A2A Routing
MAB/Thompson Sampling and load-aware routing.
When a workflow needs to dispatch work to an external A2A agent, it rarely needs to name one explicitly. Instead, the platform's routing layer selects the best available agent using a three-stage pipeline: skill matching narrows the field to capable agents, load-aware constraints exclude or down-weight busy or unhealthy agents, and Thompson Sampling selects among the remaining candidates by drawing from each agent's Beta posterior distribution.
This pipeline is self-improving: every session outcome feeds back into the Beta parameters, so the system learns from experience without any manual tuning.
Architecture overview
Stage 1: Skill matching
matchAgentsForWork scores every registered agent against the work request. The match considers:
- Required skills - agents that lack all required skill IDs are excluded entirely
- Work type - agents whose skill tags include the requested work type score higher
- Description similarity - when a description is supplied, tag overlap is used as an additional signal
The output is a ranked list of SkillMatchCandidate objects, each with an agentCardId and a capabilityScore in [0, 1].
Stage 2: Load-aware constraints
routeWorkToAgent (in load-aware-router.ts) applies health and load constraints to the skill-matched candidates before passing them to the MAB selector.
Health constraints
| Health status | Constraint | Default factor |
|---|---|---|
healthy | No penalty | 1.0 |
degraded | Penalty multiplier applied to the sampled value | 0.5 |
unknown | Smaller penalty multiplier | 0.8 |
unreachable | Excluded entirely - never routed to | - |
Load constraints
Active task counts are read from the a2a_task_instances table via DbLoadProvider - cross-process safe, no in-memory counters in production.
| Condition | Constraint | Default |
|---|---|---|
| Tasks below soft cap | No penalty | soft cap = 5 |
| Tasks at or above soft cap | Penalty multiplier | factor = 0.5, soft cap = 5 |
| Tasks at or above hard cap | Excluded entirely | hard cap = 10 |
Cost-sensitive routing override
When costSensitive: true is passed to the router, the MAB sampling step is bypassed: the router selects the cheapest capable agent by agent_cards.cost_per_task. When multiple agents tie on cost, MAB breaks the tie among them.
Cost is read first from the agent_cards.cost_per_task numeric column (added in migration 0048), with fallback to the legacy card.metadata.costPerTask JSONB path.
Constraint configuration
All thresholds can be overridden per routing call via ConstraintConfig:
const result = await routeWorkToAgent({
orgId,
workType: 'qa',
description: 'Run integration tests',
options: {
constraints: {
degradedPenalty: 0.3, // more aggressive penalty for degraded agents
loadSoftCap: 3, // tighter load threshold
loadHardCap: 6,
},
},
})Stage 3: Thompson Sampling (MAB selector)
Each agent is modelled as a bandit arm with a Beta(α, β) distribution tracking its empirical success rate. At selection time, the platform draws one sample from each arm's distribution; the agent with the highest sample wins.
Beta distribution
Beta(α, β) has mean α/(α+β). A new agent starts with the uninformed prior Beta(1, 1) - uniform over [0, 1]. As outcomes accumulate, the distribution concentrates around the agent's true success rate:
- Successful session (reward = 1): Δα = +1, Δβ = 0 → mean shifts toward 1
- Failed session (reward = 0): Δα = 0, Δβ = +1 → mean shifts toward 0
- Crash (legacy discrete path): Δβ += 3 (the
CRASH_PENALTY)
The continuous reward path (reward ∈ [0, 1]) is preferred and matches the donmai Redis store's math: Δα = reward, Δβ = 1 − reward. Fractional rewards from code-survival blends use the optional weight parameter (λ < 1) so slow survival signals blend onto - rather than overwhelm - fast per-session signals.
Gamma sampler
Beta sampling uses Marsaglia and Tsang's Gamma sampler (doi:10.1145/358407.358414): X = G₁/(G₁+G₂) where G₁ ~ Gamma(α, 1) and G₂ ~ Gamma(β, 1). Box-Muller is used for the intermediate normal draw. The sampler is deterministic when you inject a custom rng function (useful for testing).
Per-work-type arm stats
Each agent has separate Beta distributions per work type, stored as (agentCardId, workType) rows in agent_arm_stats. When workType is supplied:
- Per-work-type rows are looked up first.
- For agents with no per-work-type row, the global (workType=NULL) row is used as a fallback.
- When
workTypeis supplied,recordOutcomeupdates both the per-work-type row AND the global row so global stats accumulate evidence across all work types.
This gives the composition (workType → capability filter → per-(agent, workType) arm) two effective levels of routing intelligence in a single call.
Cold-start priors
When no arm row exists at all (truly new agent), the default prior is Beta(1, 1) - the most uncertainty-maximising starting point. Opt in to proficiency-derived priors by passing useProficiencyPriors: true; this seeds cold-start arms with a Beta prior derived from the agent's session-diagnostic history.
Single-candidate fast path
When skill matching and constraints leave only one candidate, the MAB sampling step is skipped. The single candidate is selected with sampledValue = 0.5 (a sentinel indicating no sampling was performed).
Feedback wiring: closing the learning loop
Session outcome → arm update
arm-attribution.ts bridges finished sessions to their MAB arms. When a session completes, it maps the session's agentCardId to its outcome and calls recordOutcome, updating the agent_arm_stats row with the appropriate Δα and Δβ.
Code-survival → survival reward blend
survival-reward-feeder.ts runs 30-day checkpoints: it checks whether agent-authored code has survived in the repository and applies a blended reward to the arm via recordOutcome with weight < 1. The blend uses a hot_weighted_rate_pct field from the code-survival scan results. This write also propagates to the donmai Redis MAB store (store='donmai_redis'), keeping the in-process and Redis posteriors in sync.
Routing audit log
Every routing decision is written to the routing_decisions table (non-fatal - a write failure never blocks routing). The record includes:
- Selected
agentCardId(or null when all candidates were constrained out) - Full
decisionMetadata: skill-matched candidates, excluded agents (with reasons), penalized agents, MAB sampled values and arm stats, active constraint flags, and fallback status
The Routing Intelligence panel reads this table to show Thompson Sampling posteriors, exploration rate, average confidence, and recent routing decisions.
Queued fallback
When all skill-matched candidates are excluded by constraints (all unreachable or over hard cap), the router returns { agentCardId: null, fallback: 'queued' }. The caller is responsible for deciding what to do - typically re-queuing the work for later dispatch.
TypeScript API
import { routeWorkToAgent } from '@/lib/a2a/load-aware-router'
import { recordOutcome } from '@/lib/a2a/mab-selector'
// Route work - all three stages run automatically
const result = await routeWorkToAgent({
orgId: 'org_123',
workType: 'development',
description: 'Implement the new user auth flow',
requiredSkills: ['typescript', 'next-auth'],
})
if (result.agentCardId === null) {
// All candidates constrained out - handle fallback
console.log('Fallback:', result.decisionMetadata.fallback)
} else {
console.log('Selected:', result.agentCardId)
// ... dispatch to the agent ...
// Record outcome when the session completes
await recordOutcome({
agentCardId: result.agentCardId,
orgId: 'org_123',
workType: 'development',
reward: 0.95, // continuous reward in [0, 1]
})
}Viewing routing decisions
The Routing Intelligence panel in your project's Performance section shows:
- Per-agent Beta posteriors visualised as probability density charts
- Exploration rate (how often the MAB selected a non-exploitation arm)
- Average confidence score across recent routing decisions
- Per-provider / per-work-type success rate table
survivalRewardfield on eachProviderPosteriorshowing the code-survival blend contribution
See Routing Intelligence for full documentation.
Related pages
- A2A Registry - register agents to make them eligible for routing
- A2A Dispatches - dispatch log with outcomes
- Routing Intelligence - Thompson Sampling dashboard
- Provider Benchmarks - success/cost/duration by provider
- A2A MCP Bridge - wire protocol the router dispatches through