Load Balancing Policies

Load Balancing Policies distribute your requests across multiple models based on weights you define. Perfect for A/B testing, gradual rollouts, and resource optimization.

How It Works

You assign weights to each model (e.g., 70%, 20%, 10%)
Each incoming request is consistently routed to one model based on the distribution
Requests with the same trace_id or user_id always go to the same model (consistency guaranteed)

Benefits

A/B Testing - Compare model performance with real traffic
Gradual Rollouts - Send 10% to new model, 90% to stable model
Cost Optimization - Route most traffic to cheaper models
Consistent Experiences - Same user always gets same model (maintains conversation context)
Policy Rollouts - Load balance between entire routing policies, not just models

Creating a Load Balancing Policy

Step 1: Create the Policy

Go to Routing Policies
Click “Create Policy”
Select “Load Balancing” as the policy type

Step 2: Configure Weights

Example Setup:

Policy Name: sonnet-distribution
Load Balancing:
- anthropic/claude-sonnet-4-5: 50% (weight: 50)
- bedrock/claude-sonnet-4-5@eu-central-1: 50% (weight: 50)

The total weights must add up to 100% (you can use any numbers - they’re normalized).

Step 3: Use the Policy in Your Code

After creating the policy, reference it with policy/your-policy-name:

from openai import OpenAI

client = OpenAI(
    base_url="https://router.requesty.ai/v1",
    api_key="your-requesty-api-key"
)

response = client.chat.completions.create(
    model="policy/sonnet-distribution",  # ← Your load balancing policy
    messages=[{"role": "user", "content": "Hello!"}]
)

Consistency Guarantee

Load balancing uses deterministic hashing to ensure the same user always gets the same model:

With trace_id: All requests with the same trace_id route to the same model
Without trace_id: Requesty generates a unique request_id for each request

This means:

✅ Multi-turn conversations stay on the same model (preserves context)
✅ User sessions get consistent behavior
✅ A/B test groups are stable

Maintaining Consistency Across Requests

To keep a user on the same model across multiple requests, pass a trace_id:

# All requests with the same trace_id go to the same model
response = client.chat.completions.create(
    model="policy/sonnet-distribution",
    messages=[{"role": "user", "content": "Hello!"}],
    extra_body={
        "requesty": {
            "trace_id": "user-12345"  # ← Same user, same model
        }
    }
)

Pro Tip: Use your internal user ID as the trace_id to ensure each user gets a consistent model experience while still benefiting from A/B testing.

Load Balancing Between Policies

You can load balance between entire routing policies, not just individual models. This is powerful for:

Canary deployments of policy changes
A/B testing different routing strategies
Gradual migration from one policy to another

Example: Policy Rollout

Let’s say you have two fallback policies: Policy A (stable):

policy/production-fallback
├─ openai/gpt-5.2
└─ anthropic/claude-sonnet-4-5

Policy B (experimental):

policy/experimental-fallback
├─ google/gemini-2.5-pro
└─ openai/gpt-5.2

Create a load balancing policy to send 20% to experimental, 80% to stable:

policy/gradual-rollout (Load Balancing)
├─ policy/production-fallback: 80%
└─ policy/experimental-fallback: 20%

Now use policy/gradual-rollout in your code. As you gain confidence, adjust the weights to 50/50, then 0/100.

When load balancing between policies, each policy must be compatible with your request parameters. For example, don’t mix embedding policies with chat completion policies.

Use Cases

A/B Testing New Models

Compare GPT-5.2 vs Gemini 2.5 Pro on real traffic:

Policy: ab-test-frontier
├─ openai/gpt-5.2: 50%
└─ google/gemini-2.5-pro: 50%

Track performance in Analytics and see which model performs better.

Gradual Model Rollout

Carefully introduce a new model:

Policy: careful-rollout
├─ openai/gpt-4o: 90%  ← Stable, proven
└─ openai/gpt-5.2: 10%  ← New, testing

Increase the weight of gpt-5.2 as you validate quality.

Cost-Optimized Distribution

Route most traffic to cheaper models, some to premium:

Policy: cost-optimized
├─ openai/gpt-4o-mini: 70%
├─ openai/gpt-4o: 20%
└─ openai/gpt-5.2: 10%

Multi-Provider Redundancy

Distribute across providers for resilience:

Policy: multi-provider
├─ openai/gpt-5.2: 40%
├─ anthropic/claude-sonnet-4-5: 40%
└─ google/gemini-2.5-pro: 20%

Key Selection (BYOK)

For each model in your load balancing policy, you can choose:

Requesty provided key - Use Requesty’s managed keys (default)
My own key - Use your BYOK credentials

Monitoring & Analytics

Track your load balancing performance:

Go to Analytics
Filter by your policy name
See the actual distribution of requests across models
Compare latency, cost, and success rates between models

The distribution should match your configured weights (±2% variance is normal due to caching).

FAQ

How does consistent hashing work?

Requesty uses the xxhash algorithm on your trace_id (or request_id if no trace_id) to deterministically select a model. The same ID always produces the same hash, which maps to the same model.

What happens if I change the weights?

Changing weights will re-distribute traffic. Some users may switch to different models. If you need stability, avoid changing weights frequently, or use separate policies for stable vs experimental traffic.

Can I load balance and have fallback?

Yes! Create a load balancing policy that points to fallback policies:

policy/lb-with-fallback (Load Balancing)
├─ policy/openai-fallback: 50%
└─ policy/anthropic-fallback: 50%

This gives you both load balancing AND automatic failover.

Do all models need to be compatible?

Yes. All models in a load balancing policy should support the same request format and features. Don’t mix chat models with embedding models, or models with different context lengths.

How do I ensure exactly 20% of users see the new model?

Use a stable trace_id (like user ID). With 100+ unique users, the distribution will converge to your configured weights (e.g., 20%). With small sample sizes, expect ±5% variance.

🚀 Getting Started

🌟 Features

🏢 Enterprise

🔗 Integrations

⚡ Frameworks

📚 API Reference

How It Works

Benefits

Creating a Load Balancing Policy

Step 1: Create the Policy

Step 2: Configure Weights

Step 3: Use the Policy in Your Code

Consistency Guarantee

Maintaining Consistency Across Requests

Load Balancing Between Policies

Example: Policy Rollout

Use Cases

A/B Testing New Models

Gradual Model Rollout

Cost-Optimized Distribution

Multi-Provider Redundancy

Key Selection (BYOK)

Monitoring & Analytics

FAQ

🚀 Getting Started

🌟 Features

🏢 Enterprise

🔗 Integrations

⚡ Frameworks

📚 API Reference

​How It Works

​Benefits

​Creating a Load Balancing Policy

​Step 1: Create the Policy

​Step 2: Configure Weights

​Step 3: Use the Policy in Your Code

​Consistency Guarantee

​Maintaining Consistency Across Requests

​Load Balancing Between Policies

​Example: Policy Rollout

​Use Cases

​A/B Testing New Models

​Gradual Model Rollout

​Cost-Optimized Distribution

​Multi-Provider Redundancy

​Key Selection (BYOK)

​Monitoring & Analytics

​FAQ

How It Works

Benefits

Creating a Load Balancing Policy

Step 1: Create the Policy

Step 2: Configure Weights

Step 3: Use the Policy in Your Code

Consistency Guarantee

Maintaining Consistency Across Requests

Load Balancing Between Policies

Example: Policy Rollout

Use Cases

A/B Testing New Models

Gradual Model Rollout

Cost-Optimized Distribution

Multi-Provider Redundancy

Key Selection (BYOK)

Monitoring & Analytics

FAQ