Eval Emission

Eval emission is the entry point of the evaluation pipeline. When an agent session reaches a terminal status (completed, failed, or stopped), the emission hook fires, checks whether the session's AgentCard has opted into evals, and writes two rows: an eval_traces row capturing I/O telemetry and an eval_runs row tying the trace to the agent card and organization.

How emission works

Emission is wrapped in Next.js after(), which keeps the lambda alive past the HTTP response boundary on serverless runtimes. The session status route's latency budget is never consumed.

Opting in

Emission is off by default. Enable it per AgentCard via the evalConfig field:

# AgentCard YAML (agent-cards/my-agent.yaml)
name: my-sdlc-agent
workType: development
evalConfig:
  enabled: true
  graders:
    - structural/zod-v1
    - model-grader/llm-judge-v1
  regressionThreshold: 0.15   # optional - default 0.15
  driftWindowDays: 7           # optional - default 7

The graders list is an ordered list of grader IDs. Grader job enqueueing runs after emission (configured in the eval runs pipeline). The enabled flag is the only required field - the rest have defaults.

Type

eval_runs

One row per emission event, tied to the AgentCard and org.

Prop

Type

WorkareaSnapshotRef

When evalConfig.enabled=true, the emission hook marks the workarea snapshot as retain='eval-permanent', preventing garbage collection:

interface WorkareaSnapshotRef {
  provider: string        // workarea provider ('e2b', 'vercel', 'local', ...)
  snapshotId: string      // opaque provider-internal ID
  retain: 'default' | 'eval-permanent' | 'evicted'
  capturedAt?: string     // ISO-8601
}

This ensures replay can re-run graders against the exact workarea state.

The `withEvalEmissionHook` decorator

The hook wraps the session status route handler. Stack it alongside other Layer-6 hooks:

// app/api/sessions/[id]/status/route.ts
export const PUT = withEvalEmissionHook(
  withLlmInferenceEmissionHook(
    withSessionStatusHook(baseHandler)
  )
)

Order does not matter - all deferred work runs in after() post-response.

The hook only fires when:

The handler returns a 2xx response.
The request body contains status in TERMINAL_STATUSES = { 'completed', 'failed', 'stopped' }.
The session's AgentCard has evalConfig.enabled = true.

Known limitations

Input/output payload capture is infrastructure. The inputPayload and outputPayload fields on eval_traces are currently null. Full payload capture requires the session state protocol to expose I/O blobs. The inputHash / outputHash fields on eval_runs are populated via SHA-256 of a sentinel string in the interim.
Grader job enqueueing is deferred. The gradeResults array starts empty; graders are run via the grader queue after emission.

Structural Grader - deterministic Zod-schema grader
Model Grader - LLM-as-judge grader
Human Grader - operator review queue grader
Eval Runs Dashboard - operator view of all eval runs with regression detection
Eval Replay - re-run graders against a frozen trace
BFSI Eval Mode - strict mode requirements for regulated environments

How emission works

Opting in

EvalConfig schema

Row shapes

eval_traces

eval_runs

WorkareaSnapshotRef

The `withEvalEmissionHook` decorator

Known limitations

On this page