Eval Runs Dashboard
Run list and regression detection.
The Eval Runs Dashboard is the operator-facing surface for reviewing agent evaluation history, tracking score trends over time, and catching regressions before they affect production. It lives at /admin/evals and is available to platform operators.
Overview
The dashboard lists eval_runs rows with enriched metadata - agent card name, dataset name, grader pass counts, aggregate score, and a rolling regression delta computed across the last 10 runs per agent card.
Filtering the run list
GET /api/admin/evals| Query parameter | Type | Default | Notes |
|---|---|---|---|
orgId | string? | - | Filter to a single org (operators see all orgs when omitted) |
agentCardId | string? | - | Filter to a specific AgentCard |
datasetId | string? | - | Filter to runs from a specific dataset |
limit | number? | 50 | Max 200 per page |
before | ISO-8601? | - | Cursor: startedAt < before for pagination |
curl -H "Authorization: Bearer $OPERATOR_TOKEN" \
"https://rensei.ai/api/admin/evals?agentCardId=ac_abc&limit=20"Run list item
Each item in the runs[] array contains:
Prop
Type
Regression detection
The API computes a rolling gradeDelta for each agent card on the page. It fetches the last REGRESSION_WINDOW * 2 = 20 runs for the card, splits them into a "recent" half and a "prior" half, and computes the mean score of each half. The delta is recentMean - priorMean.
REGRESSION_WINDOW = 10
DEFAULT_REGRESSION_THRESHOLD = 0.15
gradeDelta = mean(recent 10 runs) - mean(prior 10 runs)
regressionAlert = |gradeDelta| > 0.15A negative gradeDelta with regressionAlert: true indicates that the agent's most recent sessions are scoring meaningfully worse than the preceding baseline. A positive gradeDelta with regressionAlert: true is an unusual improvement - worth investigating too.
The threshold defaults to 0.15. Per-AgentCard overrides via evalConfig.regressionThreshold are planned but not yet applied at query time; the default is used for all cards today.
Regression detection requires at least 20 historical runs for a given AgentCard before the gradeDelta is meaningful. Cards with fewer than 20 runs return gradeDelta: null and regressionAlert: false.
Score trend chart
The /admin/evals page renders a per-AgentCard score trend chart for any card with more than one run in the current time window (default 7 days, configurable via the driftWindowDays field on evalConfig). The chart plots score on the Y-axis and startedAt on the X-axis, with a horizontal reference line at the card's prior-window mean. Runs with regressionAlert: true are highlighted in red.
The chart is rendered client-side from the runs[] response - no separate charting endpoint is needed. For scripting access, the same data is available via the list endpoint filtered by agentCardId.
// Example response with a regression alert
{
"runs": [
{
"id": "evr_3f7a9c12e45b",
"agentCardId": "ac_abc",
"agentName": "development-agent",
"workType": "development",
"startedAt": "2026-06-02T14:00:00.000Z",
"score": 0.51,
"passCount": 1,
"totalGraders": 2,
"gradeDelta": -0.22,
"regressionAlert": true,
"gradeResults": [
{ "graderId": "structural/zod-v1", "score": 1.0, "pass": true, "reasoning": "Output matches schema." },
{ "graderId": "model-grader/llm-judge-v1", "score": 0.02, "pass": false, "reasoning": "Output lacks..." }
]
}
],
"hasMore": false
}Run detail
GET /api/admin/evals/:runIdReturns the full eval_runs row for a single run, including the complete gradeResults array and linked trace metadata. The detail page at /admin/evals/:runId renders this as a per-grader breakdown with the score, pass/fail status, and reasoning for each grader.
The detail page also surfaces pending human grader entries with a review form when metadata.pending: true is present.
Pagination
The list uses cursor-based pagination on startedAt:
# First page
GET /api/admin/evals?limit=50
# Next page - pass the startedAt of the last item as cursor
GET /api/admin/evals?limit=50&before=2026-06-01T10:00:00.000ZhasMore: true in the response indicates additional pages exist.
Related pages
- Eval Emission - how eval runs are created at session terminal
- Structural Grader - deterministic Zod grader
- Model Grader - LLM-as-judge grader
- Human Grader - pending review items that appear in this dashboard
- Eval Replay - re-run graders against a frozen trace from this dashboard
- BFSI Eval Mode - compliance requirements that affect what appears in the queue