Provider Benchmarks

Compare LLM provider performance across cost, speed, and reliability. Provider Benchmarks shows which providers succeed most often, cost the least, and respond the fastest - helping you optimize your provider routing strategy.

Overview

When a workflow runs, it routes each agent task to an LLM provider (Anthropic, OpenAI, etc.). The Provider Benchmarks panel tracks:

Provider name - Anthropic, OpenAI, Google Gemini, etc.
Success rate - % of tasks that completed without error
Avg cost - Mean LLM cost (tokens × model pricing) per task
Avg duration - Mean wall-clock time from request to response
Total runs - How many times this provider was used
Cost-efficiency score - Composite metric: (success_rate × speed) / cost

Key Metrics

Success Rate

What % of tasks completed successfully with this provider?

Data source: cost_events and session_activities (LLM-call spans)
Calculation: Successful calls / total calls (excludes timeouts, 5xx errors, invalid responses)
Typical range: 95-99.5% (modern LLM providers are very reliable)
Red flag: <95% → Integration issue (rate limiting, credentials, malformed prompts)

Average Cost Per Task

Mean USD cost per LLM invocation.

Data source: cost_events.raw_cost_usd + token_count
Composition: Input tokens × input_rate + output tokens × output_rate
Typical cost per task:
- Anthropic Claude 3 Haiku: $0.01-0.05 (fast, cheap)
- Anthropic Claude 3 Sonnet: $0.05-0.15 (balanced)
- OpenAI GPT-4o: $0.10-0.30 (expensive, powerful)
- OpenAI GPT-4 Turbo: $0.03-0.10 (cheaper GPT-4)

High cost drivers:

Large context windows (long system prompts, memory)
Output-heavy tasks (code generation, long reasoning)
Expensive models (GPT-4 > Sonnet > Haiku)

Average Duration

Wall-clock time from request submission to response receipt.

Data source: session_activities span timing (request_at → response_at)
Typical latencies:
- Anthropic: 0.5-2.0s (fast, streaming)
- OpenAI: 1.0-3.0s (variable, depends on load)
- Google Gemini: 1.0-4.0s (slower)
- Local/On-prem: 2.0-10.0s (depends on hardware)

High latency drivers:

Provider overload (queuing)
Large outputs (long generation)
Network distance (geographic latency)

Total Runs

How many times was this provider invoked in the time window?

Trend: Week-over-week change; used to detect migration (e.g., "switching from OpenAI to Anthropic")
Usage pattern: Concentrated on few providers, or distributed?

Cost-Efficiency Score

Composite metric combining success, speed, and cost.

Formula: (success_rate × speed_factor) / cost_normalized
- success_rate: 0-1 (higher is better)
- speed_factor: inverse of duration (higher is better, e.g., 1 / duration_seconds)
- cost_normalized: cost relative to cheapest provider (1 = cheapest, 2 = 2x cheapest)
Range: 0-100 (higher is better)
Example:
- Anthropic Haiku: 99% success, 1s latency, $0.02/task → score = 85
- OpenAI GPT-4: 99% success, 2s latency, $0.20/task → score = 25

Use this to find the "sweet spot" provider for your workflow.

Display & Filtering

The Performance → Provider Benchmarks panel shows:

Time window - 7d, 30d, 90d
Work-type filter - Optional; compare costs/speed across research, dev, qa, acceptance (some work-types may prefer certain providers)
Sort order - By count (default), cost, duration, success rate, or efficiency score

Key visualizations:

Provider comparison table - Rows for each provider; columns for success %, cost, duration, runs
Cost vs. success scatter plot - Each provider is a dot; right side high-success, bottom low-cost
Duration histogram - Distribution of response times per provider
Total runs bar chart - Usage breakdown (pie or stacked bar)
Cost-efficiency ranking - Ranked list with score and trend

Data Sources

Provider traces:

session_activities table; each LLM call is a span with type='action', action.type='llm.invoke'
duration_ms: start → end timestamp
provider: which LLM provider answered

Cost:

cost_events table; one row per session with cost breakdown by provider
provider_cost_breakdown: { [provider]: raw_cost_usd }
Token counts per provider if available

Status:

session_activities.error - non-null if call failed (rate limit, timeout, invalid response)

Interpreting Results

One Provider Dominates (>80% of runs)

✓ Consistent provider if performance is good
⚠ Vendor lock-in risk; single point of failure

Action: Consider multi-provider failover (see routing intelligence). Keep 10-20% of traffic on a backup provider as insurance.

Provider Success Rates Differ >5%

⚠ Something is wrong; investigate the low-success provider

Investigation:

# Fetch errors from low-success provider
curl -H "Authorization: Bearer $RENSEI_API_KEY" \
  'https://api.rensei.ai/api/factory/events?provider=openai&type=error&after=2026-05-03'

# Group by error type
# → If "rate limit exceeded" → Upgrade API quota
# → If "authentication failed" → Rotate API key
# → If "timeout" → Network issue or provider overload

Cost Differs Significantly (e.g., Haiku $0.02, GPT-4 $0.20)

✓ Expected; different models have different pricing
⚠ But if cost is 10x and success is similar, you're overpaying

Action:

Swap expensive model for cheaper one (e.g., GPT-4 Turbo → GPT-4o)
Use smaller model for routine tasks (e.g., Haiku for classification)
See Model Routing for cost-aware routing config

Latency Differs Significantly (e.g., Anthropic 1s, OpenAI 4s)

⚠ OpenAI may be overloaded or geographically distant

Action:

Check OpenAI status page for incidents
Consider geographic routing (route to closest provider)
See Model Routing for latency-aware routing

Cost-Efficiency Score Reveals a Winner

✓ High score → Optimal provider for this work-type
Recommendation: Use as default; reserve other providers for fallback

Example: Haiku efficiency 85, GPT-4o efficiency 60, GPT-4 efficiency 25 → Default to Haiku; failover to GPT-4o; use GPT-4 only for complex reasoning (if needed)

Upward Cost Trend (Week-over-week)

⚠ Likely cause: Migration to more expensive model

Verification:

Check provider mix in "Total Runs" chart; did % allocation shift?
Check token counts; did context window grow?
Check model selection in workflow definition (did you upgrade?)

Action: Decide if upgrade is intentional (higher quality) or accidental (revert).

Optimization Playbook

Goal: Reduce Cost

Identify most-used expensive provider (e.g., GPT-4)
Test swapping to cheaper model (e.g., Haiku, Sonnet)
Check success rate doesn't drop >2%
If success ≥ threshold → Deploy swap in your model-routing config
Measure savings (usually 50-80% cost reduction)

Goal: Improve Speed

Identify slowest provider
Check if it's actually slow (latency) or just heavily used (queued)
If network latency → Use geographic routing
If queue → Upgrade provider's API tier or use failover
If model reasoning (inherently slow) → Acceptable, just accept the latency

Goal: Improve Reliability

Find provider with <98% success rate
Identify error type (rate limit, auth, timeout)
Fix root cause (upgrade quota, rotate key, improve network)
Alternatively, implement failover so errors don't block (see Routing Intelligence)

Goal: Multi-Provider Resilience

Keep primary provider with high cost-efficiency score
Add secondary provider with good success rate (even if slower/more expensive)
Configure failover in Model Routing
Run periodic (e.g., 10% of traffic) on secondary to catch degradation early

API & Programmatic Access

Component shape (TypeScript):

interface ProviderBenchmark {
  provider: string           // "anthropic", "openai", "gemini", etc.
  successRate: number        // 0-100
  avgCostUsd: number
  avgDurationMs: number
  totalRuns: number
  costEfficiencyScore: number  // 0-100
  trend7d: {
    costDelta: number        // % change from 7 days ago
    successDelta: number
    durationDelta: number
  }
}

The metrics are aggregated server-side. No standalone API endpoint exists; data is part of the performance dashboard query.

To fetch raw provider metrics:

curl -H "Authorization: Bearer $RENSEI_API_KEY" \
  'https://api.rensei.ai/api/factory/metrics?metric=provider&workspaceId=ws_123&after=2026-05-03T00:00:00Z&before=2026-06-03T00:00:00Z'

Returns array of provider benchmark objects.

Routing Intelligence - Thompson Sampling provider selection
Model Routing - Configure provider pools and cost routing
Agent Reliability - Session success rates (aggregate view)
Phase Breakdown - Cost by phase (may vary by provider)

On this page