Model Grader

The model grader dispatches a secondary LLM inference call (a "judge call") to score an agent's output against a rubric you supply. Use it for semantic quality questions that cannot be expressed as a Zod schema - code correctness, tone, completeness, business rule adherence - where you need a human-readable score and explanation.

Grader ID

model-grader/llm-judge-v1

How it works

The judge model is called at temperature: 0 with a structured JSON response format. It receives your rubric - hydrated with the agent's actual input and output - and is instructed to return only:

{ "score": 0.82, "explanation": "One-paragraph reasoning." }

Constructor

new ModelGrader(
  rubric: string,
  modelProfileId: string,
  threshold?: number,         // default 0.7
  judgeOpts?: Partial<JudgeInferenceOpts>
)

Parameter	Type	Default	Notes
`rubric`	`string`	required	Rubric prompt template - supports `{{input}}` and `{{output}}` substitution
`modelProfileId`	`string`	required	Platform model profile ID used to resolve the judge model
`threshold`	`number`	`0.7`	Minimum `score` for `pass: true`
`judgeOpts`	`object?`	-	Explicit `{ model, baseUrl, apiKey }` overrides (bypasses profile resolution; useful for admin replay)

Writing a rubric

The rubric is a prompt template. Use {{input}} and {{output}} as placeholders:

const rubric = `
You are evaluating a code review agent.

Agent input (the PR diff):
{{input}}

Agent output (the review):
{{output}}

Score 1.0 if all of the following are true:
- The review identifies at least one specific code concern.
- The review clearly states whether it approves or requests changes.
- The tone is professional and constructive.

Score 0.0 if the output is empty, irrelevant, or unprofessional.

Use the full 0.0-1.0 range for partial credit.
`

At evaluation time, {{input}} and {{output}} are replaced with JSON-serialized versions of ctx.input and ctx.output. Plain strings are inserted without extra quoting.

Model resolution

The judge model is resolved in this priority order:

judgeOpts.model - explicit override (admin replay, testing)
Platform profile cascade: getProfileById(modelProfileId) → model catalog lookup → "provider/modelId" string
LLM_JUDGE_MODEL environment variable
Hard-coded fallback: anthropic/claude-3-5-haiku

The resolved model ID is returned in GradeResult.metadata.judgeModel for audit purposes.

Usage

import { ModelGrader } from '@/lib/evals/graders/model-grader'

const grader = new ModelGrader(
  `Agent output:\n{{output}}\n\nDoes it correctly summarize the PR in at least 2 sentences? Score 1.0=yes, 0.0=no.`,
  'my-org-model-profile-id',
  0.7,
)

const result = await grader.evaluate({
  input: { prTitle: 'Add Cedar policy for A2A dispatch' },
  output: { summary: 'This PR adds a Cedar policy that gates A2A task dispatch...' },
  traceRef: 'evt_abc123',
})

// result:
// {
//   graderId: 'model-grader/llm-judge-v1',
//   score: 0.9,
//   pass: true,
//   reasoning: 'The summary clearly identifies the PR's purpose...',
//   metadata: { judgeModel: 'anthropic/claude-3-5-haiku-20241022' }
// }

GradeResult schema

Prop

Type

JudgeInferenceOpts

When bypassing profile resolution (admin replay, CI fixtures):

interface JudgeInferenceOpts {
  model: string       // e.g. 'anthropic/claude-3-5-haiku'
  baseUrl?: string    // default: LLM_BASE_URL env or 'https://openrouter.ai/api/v1'
  apiKey?: string     // default: LLM_API_KEY or OPENROUTER_API_KEY env
}

Performance characteristics

Property	Value
Mode	`async`
LLM call	Yes - one completion per `evaluate()` call
Latency	1-5 s typical (depends on judge model)
Temperature	0 (deterministic, reproducible)
Max tokens	512
Response format	`json_object`

The model grader is non-blocking in the eval pipeline - grader runs happen asynchronously after emission. Do not use the model grader as a synchronous gate in the dispatch hot path.

Error handling

The grader never rejects. On any error (API failure, malformed JSON, rate limit), it returns:

{
  "graderId": "model-grader/llm-judge-v1",
  "score": 0,
  "pass": false,
  "reasoning": "ModelGrader evaluate failed: <error message>",
  "metadata": { "error": true }
}

This ensures pipeline safety - a failed model grader does not block the eval run.

Registering in an AgentCard

evalConfig:
  enabled: true
  graders:
    - structural/zod-v1            # runs first - zero cost
    - model-grader/llm-judge-v1   # runs after structural passes

For high-volume agents, consider running the structural grader alone first and only triggering the model grader when the structural check passes.

Structural Grader - deterministic zero-cost grader; run before model grader
Human Grader - operator review for cases where LLM judgment is insufficient
Eval Emission - how graders are triggered after session completion
Eval Replay - re-run with a different model profile without re-running the session
BFSI Eval Mode - compliance requirements for regulated environments

On this page