Agent Reliability

Track agent stability and recovery patterns. The Agent Reliability panel shows completion rates, crash frequency, and turn distribution across your workflow runs.

Overview

Agent reliability metrics measure how consistently your agents complete work and recover from errors. The panel displays:

Completed - Sessions that finished with a workResult (success or rework)
Rework - Sessions that required manual intervention or re-attempt
Success Rate - Completed ÷ (Completed + Rework)
Crash Rate - Proportion of sessions that hit unrecoverable errors
Crash Count - Total number of crash events
Recovery Count - Sessions that recovered from a crash before terminal completion

Session Lifecycle

Understanding where a session can go helps you interpret reliability numbers correctly:

stateDiagram-v2
    [*] --> running: session created
    running --> completed: workResult=passed
    running --> reworked: workResult=rework
    running --> crashed: unrecoverable error
    running --> recovered: transient error → auto-retry
    recovered --> completed: retry succeeded
    recovered --> crashed: retry also failed
    completed --> [*]
    reworked --> [*]
    crashed --> [*]

completed → counted in "Completed" and "Success Rate"
reworked → counted in "Rework" (reduces success rate)
crashed → counted in "Crash Count" and "Crash Rate"
recovered → counted in "Recovery Count" (session eventually reached completed or crashed)

Key Metrics

Completion Metrics

Completed

Count of sessions with terminal state completed or reworked
Indicates sessions that reached an end state (whether success or failure)

Rework

Count of sessions that returned workResult = 'rework' or were manually overridden by a human
Tracks sessions requiring a second pass or escalation to human review

Success Rate (%)

Calculated as: completed / (completed + rework) * 100
High success rate (>80%) indicates robust agent execution
Lower rates signal workflow issues, credential problems, or integration gaps

Crash & Recovery

Crash Rate (%)

Proportion of sessions that encountered an unrecoverable error (e.g., runtime panic, infrastructure failure)
Crashes are distinct from rework - a crashed session cannot emit a structured work result
High crash rates warrant investigation of agent logs and execution environment health

Crash Count

Absolute number of crash events in the time window
Useful for detecting spikes and trending over time

Recovery Count

Sessions that experienced an error mid-execution but recovered and eventually completed
Indicates resilient session lifecycle (e.g., transient network errors that auto-retried)
Recovery implies the session continued past the error, not that it succeeded - check completion metrics for outcomes

Turn distribution visualizes session length: how many agent "turns" (back-and-forth interactions) does a typical session take? A turn corresponds to a distinct thought/action/response triplet in the activity log.

Turns (1-5, 6-10, 11-20, 20+)
├─ 1-5 turns:   45%  [████████░░] - Quick, direct sessions
├─ 6-10 turns:  35%  [██████░░░░] - Moderate iteration
├─ 11-20 turns: 15%  [███░░░░░░░] - Complex reasoning
└─ 20+ turns:   5%   [█░░░░░░░░░] - Edge cases / hangs

Interpretation:

Skew toward 1-5 turns: Agent is highly deterministic, minimal tooling or well-structured prompts
Bimodal distribution (1-5 and 11-20): Two distinct task pathways - simple and complex - handled by the same agent. Consider splitting into two agents.
Long tail (20+ turns): Potential infinite loops or over-iteration; investigate workflow design or gate exit conditions
All sessions in one bucket: Possible templated behavior or a single canonical task type

Screenshot placeholder: The turn distribution chart (stacked bar or donut) showing cohort percentages would help readers recognize the typical shape in their own dashboards.

Display & Filtering

The Performance dashboard hosts the Agent Reliability panel with controls for:

Time window - Last 7 days, 30 days, or 90 days (configurable via dashboard settings)
Project filter - Drill into a specific project's reliability
Workflow filter - Optional; focus on one workflow definition
Work-type filter - Optional; compare reliability across research/dev/qa/acceptance stages

Data Sources

All metrics derive from the agent_sessions and session_activities tables:

Session completion state - agent_sessions.state (completed, reworked, crashed)
Work result - agent_sessions.workResult enum (success, rework, crashed)
Activity log - session_activities rows; crash events marked with type='error' and error severity
Turn count - Activity count per session (distinct thought/action/response triplets)

Visual Components

Outcome Distribution Bar

A horizontal stacked bar shows completion breakdown:

[████ Pass ██ Rework]
 80%       20%

Green for completed, red for rework. Aids at-a-glance health assessment.

Metric Cards

Four cards at the top of the panel:

┌─ Completed      ┌─ Rework         ┌─ Success Rate   ┌─ Crash Rate
│ 847 sessions    │ 132 sessions     │ 86.5%           │ 2.3%
└────────────────└────────────────└────────────────└──────────────

Turn Distribution Chart

Stacked bar or donut chart breaking down session length cohorts. Hover reveals counts and percentages.

API & Querying

Component props (TypeScript):

export interface AgentStats {
  completed: number
  failed: number        // "rework" count
  successRate: number   // 0-100
  crashRate: number     // 0-100
  crashCount: number
  recoveryCount: number
  toolUsage: Array<{
    name: string
    count: number
    errorRate: number
  }>
  turnDistribution: Array<{
    label: string      // "1-5", "6-10", etc.
    count: number
  }>
}

The panel is typically populated by a dashboard fetch call that aggregates these metrics server-side (implementation in performance-client.tsx).

Interpreting Results

Healthy Agent Fleet

✓ Success rate >85%
✓ Crash rate <5%
✓ Turn distribution skews toward 1-5 and 6-10 (deterministic execution)
✓ Recovery count >0 (resilience is working)

Warning Signs

⚠ Success rate 70-85% - rework happening; check workflow design and condition logic
⚠ Success rate <70% - systemic issue; likely credential/integration problem or gate stuck
⚠ Crash rate >10% - infrastructure or agent code instability; check execution logs
⚠ Skew to 20+ turns - potential infinite loop or over-iteration; verify loop exit conditions

Action Items

High rework rate - Review human overrides in the performance dashboard; identify repeated escalations
High crash rate - Check execution environment logs (Sentry, runtime logs) for stack traces
Long turns - Inspect session activity stream to find where iteration exceeds expectations

Tool Analytics - Tool invocation frequency and error rates
Rework & Escalation - Dive deeper into rework patterns
Session Detail - Inspect individual session traces and activities
Session Inspector - 7-tab deep-debug view for crash forensics

On this page