Routing Intelligence

Routing bandit re-ranking is preview/beta and off by default per tenant.

Watch the platform learn which providers work best for your workloads. Routing Intelligence shows Thompson Sampling posteriors that drive real-time provider selection, plus code-survival rewards that close the learning loop from code quality back to model routing.

Overview

The platform uses multi-armed bandit (MAB) selection via Thompson Sampling to route tasks to providers. Instead of static provider assignment, each workflow learns dynamically which providers succeed, cost less, and run faster - and routes accordingly.

The Routing Intelligence panel reveals:

Exploration Rate - What fraction of decisions are deliberate exploration steps (not the current best arm)
Avg Confidence - Mean confidence score across all active arms (0-100%; color-coded green/amber/red)
Provider Posteriors table - Per (provider × work-type) arm with expectedReward, confidence, totalObservations, and the optional survivalReward column
Hot-path weighting indicator - A flame badge appears when at least one arm carries a survival-derived reward (survivalRewardCount > 0)
Posterior distribution chart - Overlaid Beta curves rendered by PosteriorDistribution
Recent Decisions - Latest routing calls labeled Exploration or Exploitation

Key Concepts

Thompson Sampling

A Bayesian approach to the exploration-exploitation trade-off.

How it works:

Observe outcomes (success/failure, cost, duration) for each provider
Fit a Beta(α, β) distribution to each provider arm (α = successes + 1, β = failures + 1)
Sample once from each distribution at decision time
Route to the provider with the highest sample value
Repeat; distributions narrow as observations accumulate

Advantage: Automatically balances trying new providers (exploration) with using known-good ones (exploitation).

Provider Posteriors

Each row in the panel represents one (provider × work-type) arm. The fields come directly from RoutingMetrics.posteriors[]:

Field	Type	What it means
`provider`	string	Provider name (e.g. `anthropic`, `openai`)
`workType`	string	SDLC work type (e.g. `research`, `dev`, `qa`, `acceptance`)
`alpha`	number	Beta distribution α parameter (successes + 1)
`beta`	number	Beta distribution β parameter (failures + 1)
`expectedReward`	number	Mean of the Beta distribution: α / (α + β)
`confidence`	number	0-1; how concentrated the posterior is
`totalObservations`	int	α + β − 2; drives the learning-signal tier
`avgCostUsd`	number?	Optional per-arm average cost
`survivalReward`	number?	Latest code-survival reward applied (0-1); `undefined` if none yet

Learning-signal tiers are derived from totalObservations:

Observations	Signal	Meaning
0	No data	Arm has no observations yet
1	At prior	Effectively at the Beta(1,1) uniform prior
2-9	Learning	Posterior is shifting but still wide
≥ 10	Converging	Posterior has meaningfully concentrated

Survival Reward and Hot-Path Weighting

Code quality feeds back into routing via survival-reward-feeder.ts. When a PR authored by an arm's provider survives 30 days without revert or significant rework, the feeder writes a reward record to both agent_arm_stats and the Donmai Redis store (store='donmai_redis').

survivalReward field on each posterior: the latest reward from survival_routing_rewards joined by ${provider}:${workType} arm key. Values ∈ [0, 1].
survivalRewardCount in summary: count of arms whose latest reward is survival-derived in this response. When non-zero, the panel header shows a flame badge - "Hot-path weighting active."
Effect: Arms accumulate positive signal from code durability even in periods with few new sessions, closing the quality → routing feedback loop.

Headline Metrics

The two cards at the top of the panel come from RoutingMetrics.summary:

Exploration Rate

What fraction of recent routing decisions were deliberate exploration steps (not the current highest-reward arm)?

Source: summary.explorationRate (0-1, displayed as %)
Default target: 5-10%
Motivation: Even when one provider clearly leads, the system still routes a small fraction to other providers to detect if they improve
High (>15%): Learning is fast but slightly inefficient - tolerable during ramp-up
Low (<3%): Exploration is nearly off; good for stability, but you may miss improvements in underused arms

Average Confidence

Mean confidence score across all active arms.

Source: summary.avgConfidence (0-1, color-coded: green ≥ 0.8, amber 0.5-0.8, red <0.5)
Low confidence: Arms have few observations; posteriors are still wide. Run more sessions to narrow them.
High confidence (green): Posteriors are well-defined; routing decisions are reliable.

Provider Posteriors Table

The sortable table below the headline cards shows one row per (provider × work-type) arm. Example:

Provider      Work Type   Exp. Reward  Confidence  Observations  Survival  Signal
anthropic     dev         0.973        94.2%       105           0.044     Converging
openai        dev         0.922        87.5%        98           --        Converging
anthropic     research    0.991        98.1%       312           0.031     Converging
anthropic     qa          0.971        90.4%        54           --        Learning
local-debug   dev         0.750        60.0%        15           --        Learning

Column guide:

Exp. Reward (expectedReward): Beta distribution mean = α / (α + β). The single best estimate of this arm's success probability.
Confidence: How concentrated the posterior is. Narrow = reliable; wide = still learning.
Observations (totalObservations): α + β − 2. Drives the learning signal label (see below).
Survival: survivalReward - the latest code-survival-derived reward applied to this arm. Blank when no survival scan has fed this arm yet.
Signal: Tier label derived from totalObservations. No data → At prior → Learning → Converging.

You can sort by any column; default sort is expectedReward descending.

Posterior Distribution Chart

Visualization of the Beta(α, β) curves for each provider, rendered by PosteriorDistribution. Each curve peaks at the provider's expectedReward. Narrow curves (converging arms) overlap less; wide curves (learning arms) overlap a lot, meaning either provider might win any given sample.

Probability Density
│
│        ╱╲
│       ╱  ╲ anthropic/dev (expectedReward=0.97)
│      ╱    ╲
│     ╱      ╲ openai/dev (expectedReward=0.92)
│    ╱        ╲
│   ╱╲        ╲
│  ╱  ╲        ╲
│ ╱    ╲╲      ╲
│╱──────╲╲──────╲
└────────────────── Probability of Success (0-1)

The overlap region is where exploration happens: both providers have a non-zero chance of being sampled even if one clearly leads.

Recent Decisions Table

The last N routing calls, each labeled Exploration (purple) or Exploitation (blue). Exploration steps carry an explorationReason string from the routing engine (visible as a tooltip); exploitation steps took the current highest-expected-reward arm.

Data Flow

flowchart TD
    A[Session completes] --> B[arm-attribution.ts]
    B --> C[agent_arm_stats\nsuccess_count / failure_count]
    D[PR merged → 30d checkpoint] --> E[code-survival-job.ts]
    E --> F[survival-reward-feeder.ts]
    F --> C
    F --> G[Donmai Redis store\nstore=donmai_redis]
    C --> H[mab-selector.ts\nsamples Beta posteriors]
    H --> I[load-aware-router.ts\napplies load / cost penalties]
    I --> J[Routing decision\nExploration or Exploitation]
    J --> K[RoutingIntelligencePanel\nexpectedReward, survivalReward,\nlearningSignal]

Provider outcomes:

agent_arm_stats - One row per (provider × work-type), updated by arm-attribution.ts after each session completes
Updated in real-time (on session terminal)

Survival rewards:

survival-reward-feeder.ts - At each 30-day code-survival checkpoint, applies reward to agent_arm_stats AND the Donmai Redis store
survival_routing_rewards table - Ledger row written per arm per checkpoint (keyed by ${provider}:${workType})
survivalReward on ProviderPosterior - Joined from this ledger when the routing-metrics API responds

Thompson Sampling runtime:

mab-selector.ts - Samples from Beta(α, β) posteriors at request time using Marsaglia-Tsang Gamma sampler
load-aware-router.ts - Applies degraded/unknown/load penalties to sampled values before final selection

Display & Filtering

The Performance → Routing Intelligence panel shows:

Time window - 7d, 30d, 90d (shorter = more recent, more volatile data)
Work-type filter - Narrow to research, dev, qa, acceptance, etc.
Project filter - Project-scoped posteriors

Panel layout:

Headline cards - Exploration rate + Avg confidence (with hot-path weighting flame badge when survivalRewardCount > 0)
Provider Posteriors table - Sortable by provider, workType, expectedReward, confidence, totalObservations; columns include survivalReward and learning-signal tier
Posterior distribution chart - Overlaid Beta curves from PosteriorDistribution component
Recent Decisions table - Exploration vs. Exploitation labels per call

Screenshot placeholder: A screenshot of the Provider Posteriors table with the flame badge active would help readers identify the hot-path weighting indicator in context.

Interpreting Results

One Provider Dominates (e.g., `expectedReward` = 0.97 vs. 0.92)

System has learned a clear winner; confidence is high (signal: Converging)
Routing will favor the lead provider, but still explore the other ~5% of the time

Action: Normal; system is optimized. Monitor for improvements in the trailing provider.

Arms in "Learning" or "At Prior" Tier

Signal badge shows Learning (2-9 observations) or At prior (1 observation)
Posteriors are still wide; routing decisions are nearly random for that arm

Action: Let it run. Need at least 10 observations for the arm to reach Converging. New providers or new work-types always start here.

`survivalReward` Column Populated

Hot-path weighting flame badge is visible in the panel header
Survival rewards are feeding routing decisions; expectedReward reflects code durability, not just session success

Verification:

# Fetch routing metrics and check survivalReward on posteriors
curl -H "Authorization: Bearer $RENSEI_API_KEY" \
  'https://api.rensei.ai/api/public/routing-metrics?workspaceId=ws_123' \
  | jq '.posteriors[] | {provider, workType, expectedReward, survivalReward}'

Exploration Rate Too High or Too Low

High (>15%) → Learning is fast but inefficient; consider reducing if system is stable
Low (<3%) → Missing improvements in underused providers; raise exploration slightly

Configuration: Exploration rate is set per organization in Model Routing settings.

Advanced: Load-Aware Routing

Thompson Sampling is pure; it ignores capacity. load-aware-router.ts adds:

Load penalty - If a provider's current queue is long (from a2a_task_instances via DbLoadProvider), reduce its sampled value
Degraded penalty - If the provider is in degraded mode (rate limited, health-failing), apply an additional penalty
Cost penalty - If a provider is expensive relative to peers, reduce its sample by the cost ratio

The final routing decision: sample_value - load_penalty - degraded_penalty - cost_penalty.

Advanced: Skill Matcher

When multiple providers can handle a task, skill-matcher.ts pre-filters to capable providers before Thompson Sampling runs:

Task requires GitHub credentials → only route to providers that have GitHub OAuth wired
Task requires GPU execution → only route to providers with GPU pools enabled

API

Endpoint: GET /api/public/routing-metrics?workspaceId=<id>&workType=<type>&limit=<n>

Returns the full RoutingMetrics shape:

curl -H "Authorization: Bearer $RENSEI_API_KEY" \
  'https://api.rensei.ai/api/public/routing-metrics?workspaceId=ws_123'

Response shape (TypeScript types from routing-metrics.ts):

interface RoutingMetrics {
  posteriors: ProviderPosterior[]   // one per (provider × work-type) arm
  recentDecisions: RoutingDecision[]
  summary: {
    totalObservations: number
    routingEnabled: boolean
    explorationRate: number          // 0-1
    avgConfidence: number            // 0-1
    survivalRewardCount?: number     // drives hot-path weighting badge
  }
  timestamp?: string
}

See Provider Benchmarks for raw cost/success/duration data without the Bayesian framing.

Provider Benchmarks - Raw success/cost/duration metrics
Model Routing - Configure providers, pools, and routing
Code Survival - Foundation for survival rewards
A2A Routing - Routing for agent-to-agent dispatch (similar MAB logic)

On this page