How It Works
- Requesty tracks latency for every model in your policy
- When a request arrives, the router sorts models by speed (fastest first)
- Your request goes to the currently fastest model
- Latency data updates in real-time based on recent performance
Benefits
- Fastest responses - Always use the quickest model available
- Automatic adaptation - Router adjusts when model performance changes
- No manual tuning - Latency optimization happens automatically
- Regional optimization - Automatically prefer nearby endpoints
Creating a Latency-Based Policy
Step 1: Create the Policy
- Go to Routing Policies
- Click “Create Policy”
- Select “Latency” as the policy type

Step 2: Select Models
Example Setup:- Policy Name:
fastest-sonnet - Models:
anthropic/claude-sonnet-4-5bedrock/claude-sonnet-4-5@eu-central-1
Step 3: Use the Policy in Your Code
Reference the policy in your model parameter:How Latency Tracking Works
Requesty measures time-to-first-token (TTFT) for streaming requests and total response time for non-streaming:- Streaming: Time from request sent → first token received
- Non-streaming: Time from request sent → complete response received
- Per-model - Each model tracked independently
- Rolling window - Based on recent requests (last ~1 hour)
- Organization-scoped - Your traffic patterns, not global averages
Cold Start Behavior: Models with no recent latency data are tried occasionally to gather performance metrics. After 5-10 requests, the router has enough data for optimal routing.
Key Selection Strategies
For each model, you can configure which API key to try first:Requesty Provided Key (Default)
Use Requesty’s managed keys only.My Own Key
Use your BYOK credentials only.Requesty First, Then BYOK
Try Requesty’s key first. If it’s slower or unavailable, try your BYOK.BYOK First, Then Requesty
Try your BYOK first. If it’s slower or unavailable, try Requesty’s key. Example: Ifanthropic/claude-sonnet-4-5 with Requesty key is faster than with BYOK, the policy will automatically prefer Requesty’s key.
Use Cases
Regional Optimization
Route to the fastest regional endpoint:eu-central-1, users in Asia get ap-southeast-1.
Provider Performance
Let the router pick the fastest provider:Cost + Speed Optimization
Combine similar-priced models and route to fastest:Combining with Other Policies
Latency routing works great with load balancing and fallback:Latency + Load Balancing
Latency + Fallback
Monitoring Latency
Track which models are fastest for your traffic:- Go to Performance Monitoring
- View time-to-first-token and total latency by model
- See how latency routing distributes traffic
FAQ
What if a model has no latency data?
What if a model has no latency data?
Models without recent data are assigned max latency (infinite). They’ll be tried occasionally (~5-10% of traffic) to gather data. Once they have metrics, they compete fairly.
Does latency routing consider cost?
Does latency routing consider cost?
No. Latency routing only considers speed. If you want cost optimization, use load balancing to prefer cheaper models, or manually order a fallback chain by price.
Can I force a specific model for some requests?
Can I force a specific model for some requests?
Yes. Instead of using the latency policy, pass a direct model name (e.g.,
openai/gpt-5.2) for requests where you need a specific model.How often does latency data update?
How often does latency data update?
Continuously. Latency metrics are updated after every request. The router uses a rolling average of recent requests (last ~1 hour) to smooth out spikes.
What happens if the fastest model fails?
What happens if the fastest model fails?
Latency routing tries models in speed order. If the fastest model fails, it tries the second-fastest, and so on. This is different from fallback policies where order is manually configured.
Can I see which model was selected?
Can I see which model was selected?
Yes! Check the response headers or request logs in Analytics. You’ll see which model handled each request.
Technical Details
Latency Calculation
Sorting Algorithm
Consistency
Unlike load balancing, latency routing does not guarantee the same user gets the same model. If your use case requires consistency, use load balancing withtrace_id instead.