Routing Intelligence
Thompson Sampling posteriors and survivalReward.
Watch the platform learn which providers work best for your workloads. Routing Intelligence shows Thompson Sampling posteriors that drive real-time provider selection, plus code-survival rewards that close the learning loop from code quality back to model routing.
Overview
The platform uses multi-armed bandit (MAB) selection via Thompson Sampling to route tasks to providers. Instead of static provider assignment, each workflow learns dynamically which providers succeed, cost less, and run faster - and routes accordingly.
The Routing Intelligence panel reveals:
- Exploration Rate - What fraction of decisions are deliberate exploration steps (not the current best arm)
- Avg Confidence - Mean confidence score across all active arms (0-100%; color-coded green/amber/red)
- Provider Posteriors table - Per
(provider × work-type)arm withexpectedReward,confidence,totalObservations, and the optionalsurvivalRewardcolumn - Hot-path weighting indicator - A flame badge appears when at least one arm carries a survival-derived reward (
survivalRewardCount > 0) - Posterior distribution chart - Overlaid Beta curves rendered by
PosteriorDistribution - Recent Decisions - Latest routing calls labeled
ExplorationorExploitation
Key Concepts
Thompson Sampling
A Bayesian approach to the exploration-exploitation trade-off.
How it works:
- Observe outcomes (success/failure, cost, duration) for each provider
- Fit a Beta(α, β) distribution to each provider arm (α = successes + 1, β = failures + 1)
- Sample once from each distribution at decision time
- Route to the provider with the highest sample value
- Repeat; distributions narrow as observations accumulate
Advantage: Automatically balances trying new providers (exploration) with using known-good ones (exploitation).
Provider Posteriors
Each row in the panel represents one (provider × work-type) arm. The fields come directly from RoutingMetrics.posteriors[]:
| Field | Type | What it means |
|---|---|---|
provider | string | Provider name (e.g. anthropic, openai) |
workType | string | SDLC work type (e.g. research, dev, qa, acceptance) |
alpha | number | Beta distribution α parameter (successes + 1) |
beta | number | Beta distribution β parameter (failures + 1) |
expectedReward | number | Mean of the Beta distribution: α / (α + β) |
confidence | number | 0-1; how concentrated the posterior is |
totalObservations | int | α + β − 2; drives the learning-signal tier |
avgCostUsd | number? | Optional per-arm average cost |
survivalReward | number? | Latest code-survival reward applied (0-1); undefined if none yet |
Learning-signal tiers are derived from totalObservations:
| Observations | Signal | Meaning |
|---|---|---|
| 0 | No data | Arm has no observations yet |
| 1 | At prior | Effectively at the Beta(1,1) uniform prior |
| 2-9 | Learning | Posterior is shifting but still wide |
| ≥ 10 | Converging | Posterior has meaningfully concentrated |
Survival Reward and Hot-Path Weighting
Code quality feeds back into routing via survival-reward-feeder.ts. When a PR authored by an arm's provider survives 30 days without revert or significant rework, the feeder writes a reward record to both agent_arm_stats and the Donmai Redis store (store='donmai_redis').
survivalRewardfield on each posterior: the latest reward fromsurvival_routing_rewardsjoined by${provider}:${workType}arm key. Values ∈ [0, 1].survivalRewardCountinsummary: count of arms whose latest reward is survival-derived in this response. When non-zero, the panel header shows a flame badge - "Hot-path weighting active."- Effect: Arms accumulate positive signal from code durability even in periods with few new sessions, closing the quality → routing feedback loop.
Headline Metrics
The two cards at the top of the panel come from RoutingMetrics.summary:
Exploration Rate
What fraction of recent routing decisions were deliberate exploration steps (not the current highest-reward arm)?
- Source:
summary.explorationRate(0-1, displayed as %) - Default target: 5-10%
- Motivation: Even when one provider clearly leads, the system still routes a small fraction to other providers to detect if they improve
- High (>15%): Learning is fast but slightly inefficient - tolerable during ramp-up
- Low (<3%): Exploration is nearly off; good for stability, but you may miss improvements in underused arms
Average Confidence
Mean confidence score across all active arms.
- Source:
summary.avgConfidence(0-1, color-coded: green ≥ 0.8, amber 0.5-0.8, red <0.5) - Low confidence: Arms have few observations; posteriors are still wide. Run more sessions to narrow them.
- High confidence (green): Posteriors are well-defined; routing decisions are reliable.
Provider Posteriors Table
The sortable table below the headline cards shows one row per (provider × work-type) arm. Example:
Provider Work Type Exp. Reward Confidence Observations Survival Signal
anthropic dev 0.973 94.2% 105 0.044 Converging
openai dev 0.922 87.5% 98 -- Converging
anthropic research 0.991 98.1% 312 0.031 Converging
anthropic qa 0.971 90.4% 54 -- Learning
local-debug dev 0.750 60.0% 15 -- LearningColumn guide:
- Exp. Reward (
expectedReward): Beta distribution mean = α / (α + β). The single best estimate of this arm's success probability. - Confidence: How concentrated the posterior is. Narrow = reliable; wide = still learning.
- Observations (
totalObservations): α + β − 2. Drives the learning signal label (see below). - Survival:
survivalReward- the latest code-survival-derived reward applied to this arm. Blank when no survival scan has fed this arm yet. - Signal: Tier label derived from
totalObservations.No data→At prior→Learning→Converging.
You can sort by any column; default sort is expectedReward descending.
Posterior Distribution Chart
Visualization of the Beta(α, β) curves for each provider, rendered by PosteriorDistribution. Each curve peaks at the provider's expectedReward. Narrow curves (converging arms) overlap less; wide curves (learning arms) overlap a lot, meaning either provider might win any given sample.
Probability Density
│
│ ╱╲
│ ╱ ╲ anthropic/dev (expectedReward=0.97)
│ ╱ ╲
│ ╱ ╲ openai/dev (expectedReward=0.92)
│ ╱ ╲
│ ╱╲ ╲
│ ╱ ╲ ╲
│ ╱ ╲╲ ╲
│╱──────╲╲──────╲
└────────────────── Probability of Success (0-1)The overlap region is where exploration happens: both providers have a non-zero chance of being sampled even if one clearly leads.
Recent Decisions Table
The last N routing calls, each labeled Exploration (purple) or Exploitation (blue). Exploration steps carry an explorationReason string from the routing engine (visible as a tooltip); exploitation steps took the current highest-expected-reward arm.
Data Flow
flowchart TD
A[Session completes] --> B[arm-attribution.ts]
B --> C[agent_arm_stats\nsuccess_count / failure_count]
D[PR merged → 30d checkpoint] --> E[code-survival-job.ts]
E --> F[survival-reward-feeder.ts]
F --> C
F --> G[Donmai Redis store\nstore=donmai_redis]
C --> H[mab-selector.ts\nsamples Beta posteriors]
H --> I[load-aware-router.ts\napplies load / cost penalties]
I --> J[Routing decision\nExploration or Exploitation]
J --> K[RoutingIntelligencePanel\nexpectedReward, survivalReward,\nlearningSignal]Provider outcomes:
agent_arm_stats- One row per(provider × work-type), updated byarm-attribution.tsafter each session completes- Updated in real-time (on session terminal)
Survival rewards:
survival-reward-feeder.ts- At each 30-day code-survival checkpoint, applies reward toagent_arm_statsAND the Donmai Redis storesurvival_routing_rewardstable - Ledger row written per arm per checkpoint (keyed by${provider}:${workType})survivalRewardonProviderPosterior- Joined from this ledger when the routing-metrics API responds
Thompson Sampling runtime:
mab-selector.ts- Samples from Beta(α, β) posteriors at request time using Marsaglia-Tsang Gamma samplerload-aware-router.ts- Applies degraded/unknown/load penalties to sampled values before final selection
Display & Filtering
The Performance → Routing Intelligence panel shows:
- Time window - 7d, 30d, 90d (shorter = more recent, more volatile data)
- Work-type filter - Narrow to
research,dev,qa,acceptance, etc. - Project filter - Project-scoped posteriors
Panel layout:
- Headline cards - Exploration rate + Avg confidence (with hot-path weighting flame badge when
survivalRewardCount > 0) - Provider Posteriors table - Sortable by provider, workType, expectedReward, confidence, totalObservations; columns include survivalReward and learning-signal tier
- Posterior distribution chart - Overlaid Beta curves from
PosteriorDistributioncomponent - Recent Decisions table - Exploration vs. Exploitation labels per call
Screenshot placeholder: A screenshot of the Provider Posteriors table with the flame badge active would help readers identify the hot-path weighting indicator in context.
Interpreting Results
One Provider Dominates (e.g., expectedReward = 0.97 vs. 0.92)
- System has learned a clear winner; confidence is high (signal: Converging)
- Routing will favor the lead provider, but still explore the other ~5% of the time
Action: Normal; system is optimized. Monitor for improvements in the trailing provider.
Arms in "Learning" or "At Prior" Tier
- Signal badge shows
Learning(2-9 observations) orAt prior(1 observation) - Posteriors are still wide; routing decisions are nearly random for that arm
Action: Let it run. Need at least 10 observations for the arm to reach Converging. New providers or new work-types always start here.
survivalReward Column Populated
- Hot-path weighting flame badge is visible in the panel header
- Survival rewards are feeding routing decisions;
expectedRewardreflects code durability, not just session success
Verification:
# Fetch routing metrics and check survivalReward on posteriors
curl -H "Authorization: Bearer $RENSEI_API_KEY" \
'https://api.rensei.ai/api/public/routing-metrics?workspaceId=ws_123' \
| jq '.posteriors[] | {provider, workType, expectedReward, survivalReward}'Exploration Rate Too High or Too Low
- High (>15%) → Learning is fast but inefficient; consider reducing if system is stable
- Low (<3%) → Missing improvements in underused providers; raise exploration slightly
Configuration: Exploration rate is set per organization in Model Routing settings.
Advanced: Load-Aware Routing
Thompson Sampling is pure; it ignores capacity. load-aware-router.ts adds:
- Load penalty - If a provider's current queue is long (from
a2a_task_instancesviaDbLoadProvider), reduce its sampled value - Degraded penalty - If the provider is in degraded mode (rate limited, health-failing), apply an additional penalty
- Cost penalty - If a provider is expensive relative to peers, reduce its sample by the cost ratio
The final routing decision: sample_value - load_penalty - degraded_penalty - cost_penalty.
Advanced: Skill Matcher
When multiple providers can handle a task, skill-matcher.ts pre-filters to capable providers before Thompson Sampling runs:
- Task requires GitHub credentials → only route to providers that have GitHub OAuth wired
- Task requires GPU execution → only route to providers with GPU pools enabled
API
Endpoint: GET /api/public/routing-metrics?workspaceId=<id>&workType=<type>&limit=<n>
Returns the full RoutingMetrics shape:
curl -H "Authorization: Bearer $RENSEI_API_KEY" \
'https://api.rensei.ai/api/public/routing-metrics?workspaceId=ws_123'Response shape (TypeScript types from routing-metrics.ts):
interface RoutingMetrics {
posteriors: ProviderPosterior[] // one per (provider × work-type) arm
recentDecisions: RoutingDecision[]
summary: {
totalObservations: number
routingEnabled: boolean
explorationRate: number // 0-1
avgConfidence: number // 0-1
survivalRewardCount?: number // drives hot-path weighting badge
}
timestamp?: string
}See Provider Benchmarks for raw cost/success/duration data without the Bayesian framing.
Related Pages
- Provider Benchmarks - Raw success/cost/duration metrics
- Model Routing - Configure providers, pools, and routing
- Code Survival - Foundation for survival rewards
- A2A Routing - Routing for agent-to-agent dispatch (similar MAB logic)