Rensei docs
A2A

A2A Routing

MAB/Thompson Sampling and load-aware routing.

When a workflow needs to dispatch work to an external A2A agent, it rarely needs to name one explicitly. Instead, the platform's routing layer selects the best available agent using a three-stage pipeline: skill matching narrows the field to capable agents, load-aware constraints exclude or down-weight busy or unhealthy agents, and Thompson Sampling selects among the remaining candidates by drawing from each agent's Beta posterior distribution.

This pipeline is self-improving: every session outcome feeds back into the Beta parameters, so the system learns from experience without any manual tuning.

Architecture overview

Stage 1: Skill matching

matchAgentsForWork scores every registered agent against the work request. The match considers:

  • Required skills - agents that lack all required skill IDs are excluded entirely
  • Work type - agents whose skill tags include the requested work type score higher
  • Description similarity - when a description is supplied, tag overlap is used as an additional signal

The output is a ranked list of SkillMatchCandidate objects, each with an agentCardId and a capabilityScore in [0, 1].

Stage 2: Load-aware constraints

routeWorkToAgent (in load-aware-router.ts) applies health and load constraints to the skill-matched candidates before passing them to the MAB selector.

Health constraints

Health statusConstraintDefault factor
healthyNo penalty1.0
degradedPenalty multiplier applied to the sampled value0.5
unknownSmaller penalty multiplier0.8
unreachableExcluded entirely - never routed to-

Load constraints

Active task counts are read from the a2a_task_instances table via DbLoadProvider - cross-process safe, no in-memory counters in production.

ConditionConstraintDefault
Tasks below soft capNo penaltysoft cap = 5
Tasks at or above soft capPenalty multiplierfactor = 0.5, soft cap = 5
Tasks at or above hard capExcluded entirelyhard cap = 10

Cost-sensitive routing override

When costSensitive: true is passed to the router, the MAB sampling step is bypassed: the router selects the cheapest capable agent by agent_cards.cost_per_task. When multiple agents tie on cost, MAB breaks the tie among them.

Cost is read first from the agent_cards.cost_per_task numeric column (added in migration 0048), with fallback to the legacy card.metadata.costPerTask JSONB path.

Constraint configuration

All thresholds can be overridden per routing call via ConstraintConfig:

const result = await routeWorkToAgent({
  orgId,
  workType: 'qa',
  description: 'Run integration tests',
  options: {
    constraints: {
      degradedPenalty: 0.3,   // more aggressive penalty for degraded agents
      loadSoftCap: 3,         // tighter load threshold
      loadHardCap: 6,
    },
  },
})

Stage 3: Thompson Sampling (MAB selector)

Each agent is modelled as a bandit arm with a Beta(α, β) distribution tracking its empirical success rate. At selection time, the platform draws one sample from each arm's distribution; the agent with the highest sample wins.

Beta distribution

Beta(α, β) has mean α/(α+β). A new agent starts with the uninformed prior Beta(1, 1) - uniform over [0, 1]. As outcomes accumulate, the distribution concentrates around the agent's true success rate:

  • Successful session (reward = 1): Δα = +1, Δβ = 0 → mean shifts toward 1
  • Failed session (reward = 0): Δα = 0, Δβ = +1 → mean shifts toward 0
  • Crash (legacy discrete path): Δβ += 3 (the CRASH_PENALTY)

The continuous reward path (reward ∈ [0, 1]) is preferred and matches the donmai Redis store's math: Δα = reward, Δβ = 1 − reward. Fractional rewards from code-survival blends use the optional weight parameter (λ < 1) so slow survival signals blend onto - rather than overwhelm - fast per-session signals.

Gamma sampler

Beta sampling uses Marsaglia and Tsang's Gamma sampler (doi:10.1145/358407.358414): X = G₁/(G₁+G₂) where G₁ ~ Gamma(α, 1) and G₂ ~ Gamma(β, 1). Box-Muller is used for the intermediate normal draw. The sampler is deterministic when you inject a custom rng function (useful for testing).

Per-work-type arm stats

Each agent has separate Beta distributions per work type, stored as (agentCardId, workType) rows in agent_arm_stats. When workType is supplied:

  1. Per-work-type rows are looked up first.
  2. For agents with no per-work-type row, the global (workType=NULL) row is used as a fallback.
  3. When workType is supplied, recordOutcome updates both the per-work-type row AND the global row so global stats accumulate evidence across all work types.

This gives the composition (workType → capability filter → per-(agent, workType) arm) two effective levels of routing intelligence in a single call.

Cold-start priors

When no arm row exists at all (truly new agent), the default prior is Beta(1, 1) - the most uncertainty-maximising starting point. Opt in to proficiency-derived priors by passing useProficiencyPriors: true; this seeds cold-start arms with a Beta prior derived from the agent's session-diagnostic history.

Single-candidate fast path

When skill matching and constraints leave only one candidate, the MAB sampling step is skipped. The single candidate is selected with sampledValue = 0.5 (a sentinel indicating no sampling was performed).

Feedback wiring: closing the learning loop

Session outcome → arm update

arm-attribution.ts bridges finished sessions to their MAB arms. When a session completes, it maps the session's agentCardId to its outcome and calls recordOutcome, updating the agent_arm_stats row with the appropriate Δα and Δβ.

Code-survival → survival reward blend

survival-reward-feeder.ts runs 30-day checkpoints: it checks whether agent-authored code has survived in the repository and applies a blended reward to the arm via recordOutcome with weight < 1. The blend uses a hot_weighted_rate_pct field from the code-survival scan results. This write also propagates to the donmai Redis MAB store (store='donmai_redis'), keeping the in-process and Redis posteriors in sync.

Routing audit log

Every routing decision is written to the routing_decisions table (non-fatal - a write failure never blocks routing). The record includes:

  • Selected agentCardId (or null when all candidates were constrained out)
  • Full decisionMetadata: skill-matched candidates, excluded agents (with reasons), penalized agents, MAB sampled values and arm stats, active constraint flags, and fallback status

The Routing Intelligence panel reads this table to show Thompson Sampling posteriors, exploration rate, average confidence, and recent routing decisions.

Queued fallback

When all skill-matched candidates are excluded by constraints (all unreachable or over hard cap), the router returns { agentCardId: null, fallback: 'queued' }. The caller is responsible for deciding what to do - typically re-queuing the work for later dispatch.

TypeScript API

import { routeWorkToAgent } from '@/lib/a2a/load-aware-router'
import { recordOutcome } from '@/lib/a2a/mab-selector'

// Route work - all three stages run automatically
const result = await routeWorkToAgent({
  orgId: 'org_123',
  workType: 'development',
  description: 'Implement the new user auth flow',
  requiredSkills: ['typescript', 'next-auth'],
})

if (result.agentCardId === null) {
  // All candidates constrained out - handle fallback
  console.log('Fallback:', result.decisionMetadata.fallback)
} else {
  console.log('Selected:', result.agentCardId)
  // ... dispatch to the agent ...

  // Record outcome when the session completes
  await recordOutcome({
    agentCardId: result.agentCardId,
    orgId: 'org_123',
    workType: 'development',
    reward: 0.95, // continuous reward in [0, 1]
  })
}

Viewing routing decisions

The Routing Intelligence panel in your project's Performance section shows:

  • Per-agent Beta posteriors visualised as probability density charts
  • Exploration rate (how often the MAB selected a non-exploitation arm)
  • Average confidence score across recent routing decisions
  • Per-provider / per-work-type success rate table
  • survivalReward field on each ProviderPosterior showing the code-survival blend contribution

See Routing Intelligence for full documentation.

On this page