Rensei docs
Performance

Provider Benchmarks

Provider success/cost/duration benchmarks.

Compare LLM provider performance across cost, speed, and reliability. Provider Benchmarks shows which providers succeed most often, cost the least, and respond the fastest - helping you optimize your provider routing strategy.

Overview

When a workflow runs, it routes each agent task to an LLM provider (Anthropic, OpenAI, etc.). The Provider Benchmarks panel tracks:

  • Provider name - Anthropic, OpenAI, Google Gemini, etc.
  • Success rate - % of tasks that completed without error
  • Avg cost - Mean LLM cost (tokens × model pricing) per task
  • Avg duration - Mean wall-clock time from request to response
  • Total runs - How many times this provider was used
  • Cost-efficiency score - Composite metric: (success_rate × speed) / cost

Key Metrics

Success Rate

What % of tasks completed successfully with this provider?

  • Data source: cost_events and session_activities (LLM-call spans)
  • Calculation: Successful calls / total calls (excludes timeouts, 5xx errors, invalid responses)
  • Typical range: 95-99.5% (modern LLM providers are very reliable)
  • Red flag: <95% → Integration issue (rate limiting, credentials, malformed prompts)

Average Cost Per Task

Mean USD cost per LLM invocation.

  • Data source: cost_events.raw_cost_usd + token_count
  • Composition: Input tokens × input_rate + output tokens × output_rate
  • Typical cost per task:
    • Anthropic Claude 3 Haiku: $0.01-0.05 (fast, cheap)
    • Anthropic Claude 3 Sonnet: $0.05-0.15 (balanced)
    • OpenAI GPT-4o: $0.10-0.30 (expensive, powerful)
    • OpenAI GPT-4 Turbo: $0.03-0.10 (cheaper GPT-4)

High cost drivers:

  • Large context windows (long system prompts, memory)
  • Output-heavy tasks (code generation, long reasoning)
  • Expensive models (GPT-4 > Sonnet > Haiku)

Average Duration

Wall-clock time from request submission to response receipt.

  • Data source: session_activities span timing (request_at → response_at)
  • Typical latencies:
    • Anthropic: 0.5-2.0s (fast, streaming)
    • OpenAI: 1.0-3.0s (variable, depends on load)
    • Google Gemini: 1.0-4.0s (slower)
    • Local/On-prem: 2.0-10.0s (depends on hardware)

High latency drivers:

  • Provider overload (queuing)
  • Large outputs (long generation)
  • Network distance (geographic latency)

Total Runs

How many times was this provider invoked in the time window?

  • Trend: Week-over-week change; used to detect migration (e.g., "switching from OpenAI to Anthropic")
  • Usage pattern: Concentrated on few providers, or distributed?

Cost-Efficiency Score

Composite metric combining success, speed, and cost.

  • Formula: (success_rate × speed_factor) / cost_normalized
    • success_rate: 0-1 (higher is better)
    • speed_factor: inverse of duration (higher is better, e.g., 1 / duration_seconds)
    • cost_normalized: cost relative to cheapest provider (1 = cheapest, 2 = 2x cheapest)
  • Range: 0-100 (higher is better)
  • Example:
    • Anthropic Haiku: 99% success, 1s latency, $0.02/task → score = 85
    • OpenAI GPT-4: 99% success, 2s latency, $0.20/task → score = 25

Use this to find the "sweet spot" provider for your workflow.

Display & Filtering

The Performance → Provider Benchmarks panel shows:

  • Time window - 7d, 30d, 90d
  • Work-type filter - Optional; compare costs/speed across research, dev, qa, acceptance (some work-types may prefer certain providers)
  • Sort order - By count (default), cost, duration, success rate, or efficiency score

Key visualizations:

  1. Provider comparison table - Rows for each provider; columns for success %, cost, duration, runs
  2. Cost vs. success scatter plot - Each provider is a dot; right side high-success, bottom low-cost
  3. Duration histogram - Distribution of response times per provider
  4. Total runs bar chart - Usage breakdown (pie or stacked bar)
  5. Cost-efficiency ranking - Ranked list with score and trend

Data Sources

Provider traces:

  • session_activities table; each LLM call is a span with type='action', action.type='llm.invoke'
  • duration_ms: start → end timestamp
  • provider: which LLM provider answered

Cost:

  • cost_events table; one row per session with cost breakdown by provider
  • provider_cost_breakdown: { [provider]: raw_cost_usd }
  • Token counts per provider if available

Status:

  • session_activities.error - non-null if call failed (rate limit, timeout, invalid response)

Interpreting Results

One Provider Dominates (>80% of runs)

  • ✓ Consistent provider if performance is good
  • ⚠ Vendor lock-in risk; single point of failure

Action: Consider multi-provider failover (see routing intelligence). Keep 10-20% of traffic on a backup provider as insurance.

Provider Success Rates Differ >5%

  • ⚠ Something is wrong; investigate the low-success provider

Investigation:

# Fetch errors from low-success provider
curl -H "Authorization: Bearer $RENSEI_API_KEY" \
  'https://api.rensei.ai/api/factory/events?provider=openai&type=error&after=2026-05-03'

# Group by error type
# → If "rate limit exceeded" → Upgrade API quota
# → If "authentication failed" → Rotate API key
# → If "timeout" → Network issue or provider overload

Cost Differs Significantly (e.g., Haiku $0.02, GPT-4 $0.20)

  • ✓ Expected; different models have different pricing
  • ⚠ But if cost is 10x and success is similar, you're overpaying

Action:

  • Swap expensive model for cheaper one (e.g., GPT-4 Turbo → GPT-4o)
  • Use smaller model for routine tasks (e.g., Haiku for classification)
  • See Model Routing for cost-aware routing config

Latency Differs Significantly (e.g., Anthropic 1s, OpenAI 4s)

  • ⚠ OpenAI may be overloaded or geographically distant

Action:

  • Check OpenAI status page for incidents
  • Consider geographic routing (route to closest provider)
  • See Model Routing for latency-aware routing

Cost-Efficiency Score Reveals a Winner

  • ✓ High score → Optimal provider for this work-type
  • Recommendation: Use as default; reserve other providers for fallback

Example: Haiku efficiency 85, GPT-4o efficiency 60, GPT-4 efficiency 25 → Default to Haiku; failover to GPT-4o; use GPT-4 only for complex reasoning (if needed)

Upward Cost Trend (Week-over-week)

  • ⚠ Likely cause: Migration to more expensive model

Verification:

  • Check provider mix in "Total Runs" chart; did % allocation shift?
  • Check token counts; did context window grow?
  • Check model selection in workflow definition (did you upgrade?)

Action: Decide if upgrade is intentional (higher quality) or accidental (revert).

Optimization Playbook

Goal: Reduce Cost

  1. Identify most-used expensive provider (e.g., GPT-4)
  2. Test swapping to cheaper model (e.g., Haiku, Sonnet)
  3. Check success rate doesn't drop >2%
  4. If success ≥ threshold → Deploy swap in your model-routing config
  5. Measure savings (usually 50-80% cost reduction)

Goal: Improve Speed

  1. Identify slowest provider
  2. Check if it's actually slow (latency) or just heavily used (queued)
  3. If network latency → Use geographic routing
  4. If queue → Upgrade provider's API tier or use failover
  5. If model reasoning (inherently slow) → Acceptable, just accept the latency

Goal: Improve Reliability

  1. Find provider with <98% success rate
  2. Identify error type (rate limit, auth, timeout)
  3. Fix root cause (upgrade quota, rotate key, improve network)
  4. Alternatively, implement failover so errors don't block (see Routing Intelligence)

Goal: Multi-Provider Resilience

  1. Keep primary provider with high cost-efficiency score
  2. Add secondary provider with good success rate (even if slower/more expensive)
  3. Configure failover in Model Routing
  4. Run periodic (e.g., 10% of traffic) on secondary to catch degradation early

API & Programmatic Access

Component shape (TypeScript):

interface ProviderBenchmark {
  provider: string           // "anthropic", "openai", "gemini", etc.
  successRate: number        // 0-100
  avgCostUsd: number
  avgDurationMs: number
  totalRuns: number
  costEfficiencyScore: number  // 0-100
  trend7d: {
    costDelta: number        // % change from 7 days ago
    successDelta: number
    durationDelta: number
  }
}

The metrics are aggregated server-side. No standalone API endpoint exists; data is part of the performance dashboard query.

To fetch raw provider metrics:

curl -H "Authorization: Bearer $RENSEI_API_KEY" \
  'https://api.rensei.ai/api/factory/metrics?metric=provider&workspaceId=ws_123&after=2026-05-03T00:00:00Z&before=2026-06-03T00:00:00Z'

Returns array of provider benchmark objects.

On this page