Latency-Based Routing

Latency-Based Routing automatically selects the fastest model for each request based on real-time performance data. Requesty continuously monitors response times and routes to the lowest-latency option.

How It Works

Requesty tracks latency for every model in your policy
When a request arrives, the router sorts models by speed (fastest first)
Your request goes to the currently fastest model
Latency data updates in real-time based on recent performance

Benefits

Fastest responses - Always use the quickest model available
Automatic adaptation - Router adjusts when model performance changes
No manual tuning - Latency optimization happens automatically
Regional optimization - Automatically prefer nearby endpoints

Creating a Latency-Based Policy

Step 1: Create the Policy

Go to Routing Policies
Click “Create Policy”
Select “Latency” as the policy type

Step 2: Select Models

Example Setup:

Policy Name: fastest-sonnet
Models:
- anthropic/claude-sonnet-4-5
- bedrock/claude-sonnet-4-5@eu-central-1

The router will automatically choose whichever is faster at request time.

Step 3: Use the Policy in Your Code

Reference the policy in your model parameter:

from openai import OpenAI

client = OpenAI(
    base_url="https://router.requesty.ai/v1",
    api_key="your-requesty-api-key"
)

response = client.chat.completions.create(
    model="policy/fastest-sonnet",  # ← Automatically uses fastest model
    messages=[{"role": "user", "content": "Hello!"}]
)

How Latency Tracking Works

Requesty measures time-to-first-token (TTFT) for streaming requests and total response time for non-streaming:

Streaming: Time from request sent → first token received
Non-streaming: Time from request sent → complete response received

Latency data is:

Per-model - Each model tracked independently
Rolling window - Based on recent requests (last ~1 hour)
Organization-scoped - Your traffic patterns, not global averages

Cold Start Behavior: Models with no recent latency data are tried occasionally to gather performance metrics. After 5-10 requests, the router has enough data for optimal routing.

Key Selection Strategies

For each model, you can configure which API key to try first:

Requesty Provided Key (Default)

Use Requesty’s managed keys only.

My Own Key

Use your BYOK credentials only.

Requesty First, Then BYOK

Try Requesty’s key first. If it’s slower or unavailable, try your BYOK.

BYOK First, Then Requesty

Try your BYOK first. If it’s slower or unavailable, try Requesty’s key. Example: If anthropic/claude-sonnet-4-5 with Requesty key is faster than with BYOK, the policy will automatically prefer Requesty’s key.

Use Cases

Regional Optimization

Route to the fastest regional endpoint:

Policy: regional-claude
├─ anthropic/claude-sonnet-4-5 (global)
├─ bedrock/claude-sonnet-4-5@us-east-1
├─ bedrock/claude-sonnet-4-5@eu-central-1
└─ bedrock/claude-sonnet-4-5@ap-southeast-1

Users in Europe automatically get eu-central-1, users in Asia get ap-southeast-1.

Provider Performance

Let the router pick the fastest provider:

Policy: fastest-frontier
├─ openai/gpt-5.2
├─ anthropic/claude-sonnet-4-5
└─ google/gemini-2.5-pro

If OpenAI is experiencing slowdowns, traffic shifts to Anthropic or Google automatically.

Cost + Speed Optimization

Combine similar-priced models and route to fastest:

Policy: fast-and-cheap
├─ openai/gpt-4o-mini
├─ anthropic/claude-3-5-haiku
└─ google/gemini-1.5-flash

All three are low-cost. Requesty picks whichever responds fastest.

Combining with Other Policies

Latency routing works great with load balancing and fallback:

Latency + Load Balancing

Policy: lb-to-latency (Load Balancing)
├─ policy/fastest-openai: 50%
└─ policy/fastest-anthropic: 50%

Each sub-policy uses latency routing, parent policy does A/B testing.

Latency + Fallback

Policy: fast-with-fallback (Fallback)
├─ policy/fastest-frontier  ← Latency-based
└─ openai/gpt-4o           ← Stable fallback

Try latency-optimized policy first, fall back to known-good model if all fail.

Monitoring Latency

Track which models are fastest for your traffic:

Go to Performance Monitoring
View time-to-first-token and total latency by model
See how latency routing distributes traffic

You’ll see traffic automatically shift to faster models over time.

FAQ

What if a model has no latency data?

Models without recent data are assigned max latency (infinite). They’ll be tried occasionally (~5-10% of traffic) to gather data. Once they have metrics, they compete fairly.

Does latency routing consider cost?

No. Latency routing only considers speed. If you want cost optimization, use load balancing to prefer cheaper models, or manually order a fallback chain by price.

Can I force a specific model for some requests?

Yes. Instead of using the latency policy, pass a direct model name (e.g., openai/gpt-5.2) for requests where you need a specific model.

How often does latency data update?

Continuously. Latency metrics are updated after every request. The router uses a rolling average of recent requests (last ~1 hour) to smooth out spikes.

What happens if the fastest model fails?

Latency routing tries models in speed order. If the fastest model fails, it tries the second-fastest, and so on. This is different from fallback policies where order is manually configured.

Can I see which model was selected?

Yes! Check the response headers or request logs in Analytics. You’ll see which model handled each request.

Technical Details

Latency Calculation

Latency Score = Weighted Average([recent requests])
- Streaming: Time to first token
- Non-streaming: Total response time
- Window: Last ~100 requests or 60 minutes

Sorting Algorithm

Fetch latency data for all models in policy
Sort models by latency (ascending)
Models without data → bottom of list
Route to lowest latency model
If that fails, try next lowest latency

Consistency

Unlike load balancing, latency routing does not guarantee the same user gets the same model. If your use case requires consistency, use load balancing with trace_id instead.

🚀 Getting Started

🌟 Features

🏢 Enterprise

🔗 Integrations

⚡ Frameworks

📚 API Reference

How It Works

Benefits

Creating a Latency-Based Policy

Step 1: Create the Policy

Step 2: Select Models

Step 3: Use the Policy in Your Code

How Latency Tracking Works

Key Selection Strategies

Requesty Provided Key (Default)

My Own Key

Requesty First, Then BYOK

BYOK First, Then Requesty

Use Cases

Regional Optimization

Provider Performance

Cost + Speed Optimization

Combining with Other Policies

Latency + Load Balancing

Latency + Fallback

Monitoring Latency

FAQ

Technical Details

Latency Calculation

Sorting Algorithm

Consistency

🚀 Getting Started

🌟 Features

🏢 Enterprise

🔗 Integrations

⚡ Frameworks

📚 API Reference

​How It Works

​Benefits

​Creating a Latency-Based Policy

​Step 1: Create the Policy

​Step 2: Select Models

​Step 3: Use the Policy in Your Code

​How Latency Tracking Works

​Key Selection Strategies

​Requesty Provided Key (Default)

​My Own Key

​Requesty First, Then BYOK

​BYOK First, Then Requesty

​Use Cases

​Regional Optimization

​Provider Performance

​Cost + Speed Optimization

​Combining with Other Policies

​Latency + Load Balancing

​Latency + Fallback

​Monitoring Latency

​FAQ

​Technical Details

​Latency Calculation

​Sorting Algorithm

​Consistency

How It Works

Benefits

Creating a Latency-Based Policy

Step 1: Create the Policy

Step 2: Select Models

Step 3: Use the Policy in Your Code

How Latency Tracking Works

Key Selection Strategies

Requesty Provided Key (Default)

My Own Key

Requesty First, Then BYOK

BYOK First, Then Requesty

Use Cases

Regional Optimization

Provider Performance

Cost + Speed Optimization

Combining with Other Policies

Latency + Load Balancing

Latency + Fallback

Monitoring Latency

FAQ

Technical Details

Latency Calculation

Sorting Algorithm

Consistency