Skip to main content
Load Balancing Policies distribute your requests across multiple models based on weights you define. Perfect for A/B testing, gradual rollouts, and resource optimization.

How It Works

  1. You assign weights to each model (e.g., 70%, 20%, 10%)
  2. Each incoming request is consistently routed to one model based on the distribution
  3. Requests with the same trace_id or user_id always go to the same model (consistency guaranteed)

Benefits

  • A/B Testing - Compare model performance with real traffic
  • Gradual Rollouts - Send 10% to new model, 90% to stable model
  • Cost Optimization - Route most traffic to cheaper models
  • Consistent Experiences - Same user always gets same model (maintains conversation context)
  • Policy Rollouts - Load balance between entire routing policies, not just models

Creating a Load Balancing Policy

Step 1: Create the Policy

  1. Go to Routing Policies
  2. Click β€œCreate Policy”
  3. Select β€œLoad Balancing” as the policy type
Load Balancing Policy

Step 2: Configure Weights

Example Setup:
  • Policy Name: sonnet-distribution
  • Load Balancing:
    • anthropic/claude-sonnet-4-5: 50% (weight: 50)
    • bedrock/claude-sonnet-4-5@eu-central-1: 50% (weight: 50)
The total weights must add up to 100% (you can use any numbers - they’re normalized).

Step 3: Use the Policy in Your Code

After creating the policy, reference it with policy/your-policy-name:
from openai import OpenAI

client = OpenAI(
    base_url="https://router.requesty.ai/v1",
    api_key="your-requesty-api-key"
)

response = client.chat.completions.create(
    model="policy/sonnet-distribution",  # ← Your load balancing policy
    messages=[{"role": "user", "content": "Hello!"}]
)

Consistency Guarantee

Load balancing uses deterministic hashing to ensure the same user always gets the same model:
  • With trace_id: All requests with the same trace_id route to the same model
  • Without trace_id: Requesty generates a unique request_id for each request
This means:
  • βœ… Multi-turn conversations stay on the same model (preserves context)
  • βœ… User sessions get consistent behavior
  • βœ… A/B test groups are stable

Maintaining Consistency Across Requests

To keep a user on the same model across multiple requests, pass a trace_id:
# All requests with the same trace_id go to the same model
response = client.chat.completions.create(
    model="policy/sonnet-distribution",
    messages=[{"role": "user", "content": "Hello!"}],
    extra_body={
        "requesty": {
            "trace_id": "user-12345"  # ← Same user, same model
        }
    }
)
Pro Tip: Use your internal user ID as the trace_id to ensure each user gets a consistent model experience while still benefiting from A/B testing.

Load Balancing Between Policies

You can load balance between entire routing policies, not just individual models. This is powerful for:
  • Canary deployments of policy changes
  • A/B testing different routing strategies
  • Gradual migration from one policy to another

Example: Policy Rollout

Let’s say you have two fallback policies: Policy A (stable):
policy/production-fallback
β”œβ”€ openai/gpt-5.2
└─ anthropic/claude-sonnet-4-5
Policy B (experimental):
policy/experimental-fallback
β”œβ”€ google/gemini-2.5-pro
└─ openai/gpt-5.2
Create a load balancing policy to send 20% to experimental, 80% to stable:
policy/gradual-rollout (Load Balancing)
β”œβ”€ policy/production-fallback: 80%
└─ policy/experimental-fallback: 20%
Now use policy/gradual-rollout in your code. As you gain confidence, adjust the weights to 50/50, then 0/100.
When load balancing between policies, each policy must be compatible with your request parameters. For example, don’t mix embedding policies with chat completion policies.

Use Cases

A/B Testing New Models

Compare GPT-5.2 vs Gemini 2.5 Pro on real traffic:
Policy: ab-test-frontier
β”œβ”€ openai/gpt-5.2: 50%
└─ google/gemini-2.5-pro: 50%
Track performance in Analytics and see which model performs better.

Gradual Model Rollout

Carefully introduce a new model:
Policy: careful-rollout
β”œβ”€ openai/gpt-4o: 90%  ← Stable, proven
└─ openai/gpt-5.2: 10%  ← New, testing
Increase the weight of gpt-5.2 as you validate quality.

Cost-Optimized Distribution

Route most traffic to cheaper models, some to premium:
Policy: cost-optimized
β”œβ”€ openai/gpt-4o-mini: 70%
β”œβ”€ openai/gpt-4o: 20%
└─ openai/gpt-5.2: 10%

Multi-Provider Redundancy

Distribute across providers for resilience:
Policy: multi-provider
β”œβ”€ openai/gpt-5.2: 40%
β”œβ”€ anthropic/claude-sonnet-4-5: 40%
└─ google/gemini-2.5-pro: 20%

Key Selection (BYOK)

For each model in your load balancing policy, you can choose:
  • Requesty provided key - Use Requesty’s managed keys (default)
  • My own key - Use your BYOK credentials

Monitoring & Analytics

Track your load balancing performance:
  1. Go to Analytics
  2. Filter by your policy name
  3. See the actual distribution of requests across models
  4. Compare latency, cost, and success rates between models
The distribution should match your configured weights (Β±2% variance is normal due to caching).

FAQ

Requesty uses the xxhash algorithm on your trace_id (or request_id if no trace_id) to deterministically select a model. The same ID always produces the same hash, which maps to the same model.
Changing weights will re-distribute traffic. Some users may switch to different models. If you need stability, avoid changing weights frequently, or use separate policies for stable vs experimental traffic.
Yes! Create a load balancing policy that points to fallback policies:
policy/lb-with-fallback (Load Balancing)
β”œβ”€ policy/openai-fallback: 50%
└─ policy/anthropic-fallback: 50%
This gives you both load balancing AND automatic failover.
Yes. All models in a load balancing policy should support the same request format and features. Don’t mix chat models with embedding models, or models with different context lengths.
Use a stable trace_id (like user ID). With 100+ unique users, the distribution will converge to your configured weights (e.g., 20%). With small sample sizes, expect Β±5% variance.