Eval Replay

Eval replay lets you re-grade a historical session against a different grader configuration without re-running the agent. The original eval run is never mutated - replay always creates a new eval_runs row that references the same frozen eval_traces row.

Use replay when:

You've updated a rubric or schema and want to re-score existing traces against the new grader.
You want to compare a model grader result with a different judge model.
You're debugging a regression and need to understand whether the problem is in the agent's output or in an outdated grader.

API

POST /api/evals/runs/:id/replay

Requires operator authentication.

Request body (all fields optional)

interface ReplayRequest {
  /** Override grader list. Defaults to the original run's graderConfig.graders. */
  graderIds?: string[]
  /** Bind the replay to a specific dataset case (for scoring metrics). */
  datasetCaseId?: string
  /** Re-judge with a different model profile (model-grader only). */
  targetModelProfileId?: string
}

Response

{
  "newRunId": "evr_9a3b2c7d1e4f",
  "gradersRun": ["structural/zod-v1", "model-grader/llm-judge-v1"],
  "gradeResults": [
    {
      "graderId": "structural/zod-v1",
      "score": 1.0,
      "pass": true,
      "reasoning": "Output matches schema."
    },
    {
      "graderId": "model-grader/llm-judge-v1",
      "score": 0.88,
      "pass": true,
      "reasoning": "The output correctly identifies...",
      "metadata": { "judgeModel": "anthropic/claude-3-5-haiku-20241022" }
    }
  ],
  "replayOf": "evr_1a2b3c4d5e6f"
}

Status 201 Created. The newRunId can be viewed in the Eval Runs Dashboard.

Example - replay with a different grader set

curl -X POST \
  -H "Authorization: Bearer $OPERATOR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"graderIds": ["structural/zod-v1"]}' \
  "https://rensei.ai/api/evals/runs/evr_1a2b3c4d5e6f/replay"

Example - replay with a different judge model

curl -X POST \
  -H "Authorization: Bearer $OPERATOR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"targetModelProfileId": "my-opus-profile-id"}' \
  "https://rensei.ai/api/evals/runs/evr_1a2b3c4d5e6f/replay"

How replay works

The route loads the original eval_runs row by id. Returns 404 if not found.

It loads the associated eval_traces row via traceRef. Returns 422 if the trace is missing (traces are retained by default; GC may have evicted old ones).

It determines the grader list: explicit graderIds in the request body takes priority, then falls back to the original run's graderConfig.graders, then to an empty list.

It runs all graders synchronously via Promise.all. Structural graders complete in microseconds; model graders make a live LLM call.

A new eval_runs row is inserted with graderConfig.replayOf = originalRunId. The original run is untouched.

Grader resolution

The replay endpoint calls registerBuiltinGraders() before resolving grader IDs. Built-in grader IDs:

ID	Type
`structural/zod-v1`	StructuralZodGrader
`model-grader/llm-judge-v1`	ModelGrader
`human-grader/v1`	HumanGrader

Custom graders registered by your organization are resolved from the grader registry as well.

GradeContext from frozen trace

The GradeContext passed to each grader is reconstituted from the frozen eval_traces row:

const gradeContext: GradeContext = {
  input: trace.inputPayload ?? null,
  output: trace.outputPayload ?? null,
  traceRef: trace.id,
}

Input and output payloads are currently null in most traces - full payload capture is not yet wired. The inputHash / outputHash on the run row are still populated. Structural graders that rely on ctx.output will score against null until payload capture ships.

New run row

The new eval run row records the replay provenance:

{
  "graderConfig": {
    "graders": ["structural/zod-v1"],
    "replayOf": "evr_1a2b3c4d5e6f",
    "targetModelProfileId": "my-opus-profile-id"
  }
}

This lets you trace the replay chain in the dashboard - filter by agentCardId and look for graderConfig.replayOf to see the replay history.

Error cases

Status	Condition
`404`	Original run not found
`422`	No `traceRef` on original run, or trace row has been GC'd
`500`	Internal error (grader registration, DB write)

Individual grader failures do not abort the replay - they return a score: 0, pass: false result with the error message in reasoning.

Eval Emission - creates the original eval_runs + eval_traces rows
Structural Grader - deterministic grader, works well in replay
Model Grader - judge model can be swapped in replay via targetModelProfileId
Eval Runs Dashboard - view replay results alongside original runs
BFSI Eval Mode - 7-year trace retention ensures replay is available for compliance audits

On this page