Eval Runs Dashboard

The Eval Runs Dashboard is the operator-facing surface for reviewing agent evaluation history, tracking score trends over time, and catching regressions before they affect production. It lives at /admin/evals and is available to platform operators.

Overview

The dashboard lists eval_runs rows with enriched metadata - agent card name, dataset name, grader pass counts, aggregate score, and a rolling regression delta computed across the last 10 runs per agent card.

Filtering the run list

GET /api/admin/evals

Query parameter	Type	Default	Notes
`orgId`	string?	-	Filter to a single org (operators see all orgs when omitted)
`agentCardId`	string?	-	Filter to a specific AgentCard
`datasetId`	string?	-	Filter to runs from a specific dataset
`limit`	number?	50	Max 200 per page
`before`	ISO-8601?	-	Cursor: `startedAt < before` for pagination

curl -H "Authorization: Bearer $OPERATOR_TOKEN" \
  "https://rensei.ai/api/admin/evals?agentCardId=ac_abc&limit=20"

Run list item

Each item in the runs[] array contains:

Prop

Type

Regression detection

The API computes a rolling gradeDelta for each agent card on the page. It fetches the last REGRESSION_WINDOW * 2 = 20 runs for the card, splits them into a "recent" half and a "prior" half, and computes the mean score of each half. The delta is recentMean - priorMean.

REGRESSION_WINDOW = 10
DEFAULT_REGRESSION_THRESHOLD = 0.15

gradeDelta = mean(recent 10 runs) - mean(prior 10 runs)
regressionAlert = |gradeDelta| > 0.15

A negative gradeDelta with regressionAlert: true indicates that the agent's most recent sessions are scoring meaningfully worse than the preceding baseline. A positive gradeDelta with regressionAlert: true is an unusual improvement - worth investigating too.

The threshold defaults to 0.15. Per-AgentCard overrides via evalConfig.regressionThreshold are planned but not yet applied at query time; the default is used for all cards today.

Regression detection requires at least 20 historical runs for a given AgentCard before the gradeDelta is meaningful. Cards with fewer than 20 runs return gradeDelta: null and regressionAlert: false.

Score trend chart

The /admin/evals page renders a per-AgentCard score trend chart for any card with more than one run in the current time window (default 7 days, configurable via the driftWindowDays field on evalConfig). The chart plots score on the Y-axis and startedAt on the X-axis, with a horizontal reference line at the card's prior-window mean. Runs with regressionAlert: true are highlighted in red.

The chart is rendered client-side from the runs[] response - no separate charting endpoint is needed. For scripting access, the same data is available via the list endpoint filtered by agentCardId.

// Example response with a regression alert
{
  "runs": [
    {
      "id": "evr_3f7a9c12e45b",
      "agentCardId": "ac_abc",
      "agentName": "development-agent",
      "workType": "development",
      "startedAt": "2026-06-02T14:00:00.000Z",
      "score": 0.51,
      "passCount": 1,
      "totalGraders": 2,
      "gradeDelta": -0.22,
      "regressionAlert": true,
      "gradeResults": [
        { "graderId": "structural/zod-v1", "score": 1.0, "pass": true, "reasoning": "Output matches schema." },
        { "graderId": "model-grader/llm-judge-v1", "score": 0.02, "pass": false, "reasoning": "Output lacks..." }
      ]
    }
  ],
  "hasMore": false
}

Run detail

GET /api/admin/evals/:runId

Returns the full eval_runs row for a single run, including the complete gradeResults array and linked trace metadata. The detail page at /admin/evals/:runId renders this as a per-grader breakdown with the score, pass/fail status, and reasoning for each grader.

The detail page also surfaces pending human grader entries with a review form when metadata.pending: true is present.

Pagination

The list uses cursor-based pagination on startedAt:

# First page
GET /api/admin/evals?limit=50

# Next page - pass the startedAt of the last item as cursor
GET /api/admin/evals?limit=50&before=2026-06-01T10:00:00.000Z

hasMore: true in the response indicates additional pages exist.

Eval Emission - how eval runs are created at session terminal
Structural Grader - deterministic Zod grader
Model Grader - LLM-as-judge grader
Human Grader - pending review items that appear in this dashboard
Eval Replay - re-run graders against a frozen trace from this dashboard
BFSI Eval Mode - compliance requirements that affect what appears in the queue

On this page