Skip to main content
Latency-Based Routing automatically selects the fastest model for each request based on real-time performance data. Requesty continuously monitors response times and routes to the lowest-latency option.

How It Works

  1. Requesty tracks latency for every model in your policy
  2. When a request arrives, the router sorts models by speed (fastest first)
  3. Your request goes to the currently fastest model
  4. Latency data updates in real-time based on recent performance

Benefits

  • Fastest responses - Always use the quickest model available
  • Automatic adaptation - Router adjusts when model performance changes
  • No manual tuning - Latency optimization happens automatically
  • Regional optimization - Automatically prefer nearby endpoints

Creating a Latency-Based Policy

Step 1: Create the Policy

  1. Go to Routing Policies
  2. Click “Create Policy
  3. Select “Latency” as the policy type
Latency Routing Policy

Step 2: Select Models

Example Setup:
  • Policy Name: fastest-sonnet
  • Models:
    • anthropic/claude-sonnet-4-5
    • bedrock/claude-sonnet-4-5@eu-central-1
The router will automatically choose whichever is faster at request time.

Step 3: Use the Policy in Your Code

Reference the policy in your model parameter:
from openai import OpenAI

client = OpenAI(
    base_url="https://router.requesty.ai/v1",
    api_key="your-requesty-api-key"
)

response = client.chat.completions.create(
    model="policy/fastest-sonnet",  # ← Automatically uses fastest model
    messages=[{"role": "user", "content": "Hello!"}]
)

How Latency Tracking Works

Requesty measures time-to-first-token (TTFT) for streaming requests and total response time for non-streaming:
  • Streaming: Time from request sent → first token received
  • Non-streaming: Time from request sent → complete response received
Latency data is:
  • Per-model - Each model tracked independently
  • Rolling window - Based on recent requests (last ~1 hour)
  • Organization-scoped - Your traffic patterns, not global averages
Cold Start Behavior: Models with no recent latency data are tried occasionally to gather performance metrics. After 5-10 requests, the router has enough data for optimal routing.

Key Selection Strategies

For each model, you can configure which API key to try first:

Requesty Provided Key (Default)

Use Requesty’s managed keys only.

My Own Key

Use your BYOK credentials only.

Requesty First, Then BYOK

Try Requesty’s key first. If it’s slower or unavailable, try your BYOK.

BYOK First, Then Requesty

Try your BYOK first. If it’s slower or unavailable, try Requesty’s key. Example: If anthropic/claude-sonnet-4-5 with Requesty key is faster than with BYOK, the policy will automatically prefer Requesty’s key.

Use Cases

Regional Optimization

Route to the fastest regional endpoint:
Policy: regional-claude
├─ anthropic/claude-sonnet-4-5 (global)
├─ bedrock/claude-sonnet-4-5@us-east-1
├─ bedrock/claude-sonnet-4-5@eu-central-1
└─ bedrock/claude-sonnet-4-5@ap-southeast-1
Users in Europe automatically get eu-central-1, users in Asia get ap-southeast-1.

Provider Performance

Let the router pick the fastest provider:
Policy: fastest-frontier
├─ openai/gpt-5.2
├─ anthropic/claude-sonnet-4-5
└─ google/gemini-2.5-pro
If OpenAI is experiencing slowdowns, traffic shifts to Anthropic or Google automatically.

Cost + Speed Optimization

Combine similar-priced models and route to fastest:
Policy: fast-and-cheap
├─ openai/gpt-4o-mini
├─ anthropic/claude-3-5-haiku
└─ google/gemini-1.5-flash
All three are low-cost. Requesty picks whichever responds fastest.

Combining with Other Policies

Latency routing works great with load balancing and fallback:

Latency + Load Balancing

Policy: lb-to-latency (Load Balancing)
├─ policy/fastest-openai: 50%
└─ policy/fastest-anthropic: 50%
Each sub-policy uses latency routing, parent policy does A/B testing.

Latency + Fallback

Policy: fast-with-fallback (Fallback)
├─ policy/fastest-frontier  ← Latency-based
└─ openai/gpt-4o           ← Stable fallback
Try latency-optimized policy first, fall back to known-good model if all fail.

Monitoring Latency

Track which models are fastest for your traffic:
  1. Go to Performance Monitoring
  2. View time-to-first-token and total latency by model
  3. See how latency routing distributes traffic
You’ll see traffic automatically shift to faster models over time.

FAQ

Models without recent data are assigned max latency (infinite). They’ll be tried occasionally (~5-10% of traffic) to gather data. Once they have metrics, they compete fairly.
No. Latency routing only considers speed. If you want cost optimization, use load balancing to prefer cheaper models, or manually order a fallback chain by price.
Yes. Instead of using the latency policy, pass a direct model name (e.g., openai/gpt-5.2) for requests where you need a specific model.
Continuously. Latency metrics are updated after every request. The router uses a rolling average of recent requests (last ~1 hour) to smooth out spikes.
Latency routing tries models in speed order. If the fastest model fails, it tries the second-fastest, and so on. This is different from fallback policies where order is manually configured.
Yes! Check the response headers or request logs in Analytics. You’ll see which model handled each request.

Technical Details

Latency Calculation

Latency Score = Weighted Average([recent requests])
- Streaming: Time to first token
- Non-streaming: Total response time
- Window: Last ~100 requests or 60 minutes

Sorting Algorithm

1. Fetch latency data for all models in policy
2. Sort models by latency (ascending)
3. Models without data → bottom of list
4. Route to lowest latency model
5. If that fails, try next lowest latency

Consistency

Unlike load balancing, latency routing does not guarantee the same user gets the same model. If your use case requires consistency, use load balancing with trace_id instead.