Human Grader

The human grader defers scoring to an operator. When it runs, it immediately returns a pending placeholder result and places the eval trace in a review queue. The eval run stays in a running-like state until an operator fulfills the grade via the admin UI. This is the required grader for acceptance-critical sessions in BFSI strict mode.

Grader ID

human-grader/v1

How it works

HumanGrader.evaluate() does not inspect the agent output at all. It:

Generates a unique queueId (hgq_<12-byte-hex>).
Records the requestedAt ISO timestamp.
Returns a GradeResult with score: 0, pass: false, reasoning: 'pending', and metadata.pending: true.

The pending entry is stored in eval_runs.grade_results (JSONB array). The operator reviews and grades it via PATCH /api/evals/runs/:id/grade, which updates the grade result array in-place with the final score.

Constructor

new HumanGrader(label?: string)

Parameter	Type	Default	Notes
`label`	`string?`	-	Display name shown in the admin review queue (e.g. `'QA acceptance gate'`)

import { HumanGrader } from '@/lib/evals/graders/human-grader'

const grader = new HumanGrader('Acceptance review - critical path')

const result = await grader.evaluate(ctx)

// result:
// {
//   graderId: 'human-grader/v1',
//   score: 0,
//   pass: false,
//   reasoning: 'pending',
//   metadata: {
//     pending: true,
//     queueId: 'hgq_3f7a9c12e45b',
//     requestedAt: '2026-06-02T14:31:00.000Z',
//     label: 'Acceptance review - critical path'
//   }
// }

GradeResult schema (pending state)

Prop

Type

Operator workflow

Navigate to Admin → Evals (/admin/evals).

Filter by Pending human review to see runs awaiting operator input.

Open a run. The detail view shows the trace (session ID, token counts, tool calls) alongside the pending grade entry.

Submit a score (0.0-1.0) and a written explanation. The platform updates the grade_results array in-place and sets pending: false.

Unfulfilled human-grader entries keep the aggregate score for the eval run low (because the placeholder score: 0 is averaged in). Complete pending reviews promptly to get accurate run-level scores and regression detection.

No new database table

The human grader stores its pending entry in the existing eval_runs.grade_results JSONB array - no separate review-queue table is needed. The queueId within metadata is the correlation key for the PATCH endpoint.

Performance characteristics

Property	Value
Mode	`async`
LLM call	None
Latency	Microseconds (`evaluate()` itself is near-zero)
Fulfillment latency	Operator-dependent

The grader itself is fast; the blocking factor is operator response time.

BFSI strict mode

In BFSI strict mode, human grader items for acceptance-phase sessions are routed to the org's review queue and must be fulfilled before the eval run can be marked complete. This is the acceptance-critical path for SR 11-7 compliance artifacts.

Registering in an AgentCard

evalConfig:
  enabled: true
  graders:
    - structural/zod-v1          # fast structural check first
    - human-grader/v1            # operator review for every completed session

Or with a label for the queue UI:

// Programmatic registration when building the grader pipeline
new HumanGrader('Acceptance gate - production deploy session')

Eval Emission - how graders are queued after session completion
Structural Grader - run before human grader to catch schema failures cheaply
Eval Runs Dashboard - where pending human reviews appear
BFSI Eval Mode - compliance requirements that mandate human graders

On this page