Provider Benchmarks
Provider success/cost/duration benchmarks.
Compare LLM provider performance across cost, speed, and reliability. Provider Benchmarks shows which providers succeed most often, cost the least, and respond the fastest - helping you optimize your provider routing strategy.
Overview
When a workflow runs, it routes each agent task to an LLM provider (Anthropic, OpenAI, etc.). The Provider Benchmarks panel tracks:
- Provider name - Anthropic, OpenAI, Google Gemini, etc.
- Success rate - % of tasks that completed without error
- Avg cost - Mean LLM cost (tokens × model pricing) per task
- Avg duration - Mean wall-clock time from request to response
- Total runs - How many times this provider was used
- Cost-efficiency score - Composite metric: (success_rate × speed) / cost
Key Metrics
Success Rate
What % of tasks completed successfully with this provider?
- Data source:
cost_eventsandsession_activities(LLM-call spans) - Calculation: Successful calls / total calls (excludes timeouts, 5xx errors, invalid responses)
- Typical range: 95-99.5% (modern LLM providers are very reliable)
- Red flag: <95% → Integration issue (rate limiting, credentials, malformed prompts)
Average Cost Per Task
Mean USD cost per LLM invocation.
- Data source:
cost_events.raw_cost_usd+token_count - Composition: Input tokens × input_rate + output tokens × output_rate
- Typical cost per task:
- Anthropic Claude 3 Haiku: $0.01-0.05 (fast, cheap)
- Anthropic Claude 3 Sonnet: $0.05-0.15 (balanced)
- OpenAI GPT-4o: $0.10-0.30 (expensive, powerful)
- OpenAI GPT-4 Turbo: $0.03-0.10 (cheaper GPT-4)
High cost drivers:
- Large context windows (long system prompts, memory)
- Output-heavy tasks (code generation, long reasoning)
- Expensive models (GPT-4 > Sonnet > Haiku)
Average Duration
Wall-clock time from request submission to response receipt.
- Data source:
session_activitiesspan timing (request_at → response_at) - Typical latencies:
- Anthropic: 0.5-2.0s (fast, streaming)
- OpenAI: 1.0-3.0s (variable, depends on load)
- Google Gemini: 1.0-4.0s (slower)
- Local/On-prem: 2.0-10.0s (depends on hardware)
High latency drivers:
- Provider overload (queuing)
- Large outputs (long generation)
- Network distance (geographic latency)
Total Runs
How many times was this provider invoked in the time window?
- Trend: Week-over-week change; used to detect migration (e.g., "switching from OpenAI to Anthropic")
- Usage pattern: Concentrated on few providers, or distributed?
Cost-Efficiency Score
Composite metric combining success, speed, and cost.
- Formula:
(success_rate × speed_factor) / cost_normalizedsuccess_rate: 0-1 (higher is better)speed_factor: inverse of duration (higher is better, e.g., 1 / duration_seconds)cost_normalized: cost relative to cheapest provider (1 = cheapest, 2 = 2x cheapest)
- Range: 0-100 (higher is better)
- Example:
- Anthropic Haiku: 99% success, 1s latency, $0.02/task → score = 85
- OpenAI GPT-4: 99% success, 2s latency, $0.20/task → score = 25
Use this to find the "sweet spot" provider for your workflow.
Display & Filtering
The Performance → Provider Benchmarks panel shows:
- Time window - 7d, 30d, 90d
- Work-type filter - Optional; compare costs/speed across research, dev, qa, acceptance (some work-types may prefer certain providers)
- Sort order - By count (default), cost, duration, success rate, or efficiency score
Key visualizations:
- Provider comparison table - Rows for each provider; columns for success %, cost, duration, runs
- Cost vs. success scatter plot - Each provider is a dot; right side high-success, bottom low-cost
- Duration histogram - Distribution of response times per provider
- Total runs bar chart - Usage breakdown (pie or stacked bar)
- Cost-efficiency ranking - Ranked list with score and trend
Data Sources
Provider traces:
session_activitiestable; each LLM call is a span withtype='action',action.type='llm.invoke'duration_ms: start → end timestampprovider: which LLM provider answered
Cost:
cost_eventstable; one row per session with cost breakdown by providerprovider_cost_breakdown: { [provider]: raw_cost_usd }- Token counts per provider if available
Status:
session_activities.error- non-null if call failed (rate limit, timeout, invalid response)
Interpreting Results
One Provider Dominates (>80% of runs)
- ✓ Consistent provider if performance is good
- ⚠ Vendor lock-in risk; single point of failure
Action: Consider multi-provider failover (see routing intelligence). Keep 10-20% of traffic on a backup provider as insurance.
Provider Success Rates Differ >5%
- ⚠ Something is wrong; investigate the low-success provider
Investigation:
# Fetch errors from low-success provider
curl -H "Authorization: Bearer $RENSEI_API_KEY" \
'https://api.rensei.ai/api/factory/events?provider=openai&type=error&after=2026-05-03'
# Group by error type
# → If "rate limit exceeded" → Upgrade API quota
# → If "authentication failed" → Rotate API key
# → If "timeout" → Network issue or provider overloadCost Differs Significantly (e.g., Haiku $0.02, GPT-4 $0.20)
- ✓ Expected; different models have different pricing
- ⚠ But if cost is 10x and success is similar, you're overpaying
Action:
- Swap expensive model for cheaper one (e.g., GPT-4 Turbo → GPT-4o)
- Use smaller model for routine tasks (e.g., Haiku for classification)
- See Model Routing for cost-aware routing config
Latency Differs Significantly (e.g., Anthropic 1s, OpenAI 4s)
- ⚠ OpenAI may be overloaded or geographically distant
Action:
- Check OpenAI status page for incidents
- Consider geographic routing (route to closest provider)
- See Model Routing for latency-aware routing
Cost-Efficiency Score Reveals a Winner
- ✓ High score → Optimal provider for this work-type
- Recommendation: Use as default; reserve other providers for fallback
Example: Haiku efficiency 85, GPT-4o efficiency 60, GPT-4 efficiency 25 → Default to Haiku; failover to GPT-4o; use GPT-4 only for complex reasoning (if needed)
Upward Cost Trend (Week-over-week)
- ⚠ Likely cause: Migration to more expensive model
Verification:
- Check provider mix in "Total Runs" chart; did % allocation shift?
- Check token counts; did context window grow?
- Check model selection in workflow definition (did you upgrade?)
Action: Decide if upgrade is intentional (higher quality) or accidental (revert).
Optimization Playbook
Goal: Reduce Cost
- Identify most-used expensive provider (e.g., GPT-4)
- Test swapping to cheaper model (e.g., Haiku, Sonnet)
- Check success rate doesn't drop >2%
- If success ≥ threshold → Deploy swap in your model-routing config
- Measure savings (usually 50-80% cost reduction)
Goal: Improve Speed
- Identify slowest provider
- Check if it's actually slow (latency) or just heavily used (queued)
- If network latency → Use geographic routing
- If queue → Upgrade provider's API tier or use failover
- If model reasoning (inherently slow) → Acceptable, just accept the latency
Goal: Improve Reliability
- Find provider with <98% success rate
- Identify error type (rate limit, auth, timeout)
- Fix root cause (upgrade quota, rotate key, improve network)
- Alternatively, implement failover so errors don't block (see Routing Intelligence)
Goal: Multi-Provider Resilience
- Keep primary provider with high cost-efficiency score
- Add secondary provider with good success rate (even if slower/more expensive)
- Configure failover in Model Routing
- Run periodic (e.g., 10% of traffic) on secondary to catch degradation early
API & Programmatic Access
Component shape (TypeScript):
interface ProviderBenchmark {
provider: string // "anthropic", "openai", "gemini", etc.
successRate: number // 0-100
avgCostUsd: number
avgDurationMs: number
totalRuns: number
costEfficiencyScore: number // 0-100
trend7d: {
costDelta: number // % change from 7 days ago
successDelta: number
durationDelta: number
}
}The metrics are aggregated server-side. No standalone API endpoint exists; data is part of the performance dashboard query.
To fetch raw provider metrics:
curl -H "Authorization: Bearer $RENSEI_API_KEY" \
'https://api.rensei.ai/api/factory/metrics?metric=provider&workspaceId=ws_123&after=2026-05-03T00:00:00Z&before=2026-06-03T00:00:00Z'Returns array of provider benchmark objects.
Related Pages
- Routing Intelligence - Thompson Sampling provider selection
- Model Routing - Configure provider pools and cost routing
- Agent Reliability - Session success rates (aggregate view)
- Phase Breakdown - Cost by phase (may vary by provider)