Model Grader
Secondary LLM rubric grader.
The model grader dispatches a secondary LLM inference call (a "judge call") to score an agent's output against a rubric you supply. Use it for semantic quality questions that cannot be expressed as a Zod schema - code correctness, tone, completeness, business rule adherence - where you need a human-readable score and explanation.
Grader ID
model-grader/llm-judge-v1How it works
The judge model is called at temperature: 0 with a structured JSON response format. It receives your rubric - hydrated with the agent's actual input and output - and is instructed to return only:
{ "score": 0.82, "explanation": "One-paragraph reasoning." }Constructor
new ModelGrader(
rubric: string,
modelProfileId: string,
threshold?: number, // default 0.7
judgeOpts?: Partial<JudgeInferenceOpts>
)| Parameter | Type | Default | Notes |
|---|---|---|---|
rubric | string | required | Rubric prompt template - supports {{input}} and {{output}} substitution |
modelProfileId | string | required | Platform model profile ID used to resolve the judge model |
threshold | number | 0.7 | Minimum score for pass: true |
judgeOpts | object? | - | Explicit { model, baseUrl, apiKey } overrides (bypasses profile resolution; useful for admin replay) |
Writing a rubric
The rubric is a prompt template. Use {{input}} and {{output}} as placeholders:
const rubric = `
You are evaluating a code review agent.
Agent input (the PR diff):
{{input}}
Agent output (the review):
{{output}}
Score 1.0 if all of the following are true:
- The review identifies at least one specific code concern.
- The review clearly states whether it approves or requests changes.
- The tone is professional and constructive.
Score 0.0 if the output is empty, irrelevant, or unprofessional.
Use the full 0.0-1.0 range for partial credit.
`At evaluation time, {{input}} and {{output}} are replaced with JSON-serialized versions of ctx.input and ctx.output. Plain strings are inserted without extra quoting.
Model resolution
The judge model is resolved in this priority order:
judgeOpts.model- explicit override (admin replay, testing)- Platform profile cascade:
getProfileById(modelProfileId)→ model catalog lookup →"provider/modelId"string LLM_JUDGE_MODELenvironment variable- Hard-coded fallback:
anthropic/claude-3-5-haiku
The resolved model ID is returned in GradeResult.metadata.judgeModel for audit purposes.
Usage
import { ModelGrader } from '@/lib/evals/graders/model-grader'
const grader = new ModelGrader(
`Agent output:\n{{output}}\n\nDoes it correctly summarize the PR in at least 2 sentences? Score 1.0=yes, 0.0=no.`,
'my-org-model-profile-id',
0.7,
)
const result = await grader.evaluate({
input: { prTitle: 'Add Cedar policy for A2A dispatch' },
output: { summary: 'This PR adds a Cedar policy that gates A2A task dispatch...' },
traceRef: 'evt_abc123',
})
// result:
// {
// graderId: 'model-grader/llm-judge-v1',
// score: 0.9,
// pass: true,
// reasoning: 'The summary clearly identifies the PR's purpose...',
// metadata: { judgeModel: 'anthropic/claude-3-5-haiku-20241022' }
// }GradeResult schema
Prop
Type
JudgeInferenceOpts
When bypassing profile resolution (admin replay, CI fixtures):
interface JudgeInferenceOpts {
model: string // e.g. 'anthropic/claude-3-5-haiku'
baseUrl?: string // default: LLM_BASE_URL env or 'https://openrouter.ai/api/v1'
apiKey?: string // default: LLM_API_KEY or OPENROUTER_API_KEY env
}Performance characteristics
| Property | Value |
|---|---|
| Mode | async |
| LLM call | Yes - one completion per evaluate() call |
| Latency | 1-5 s typical (depends on judge model) |
| Temperature | 0 (deterministic, reproducible) |
| Max tokens | 512 |
| Response format | json_object |
The model grader is non-blocking in the eval pipeline - grader runs happen asynchronously after emission. Do not use the model grader as a synchronous gate in the dispatch hot path.
Error handling
The grader never rejects. On any error (API failure, malformed JSON, rate limit), it returns:
{
"graderId": "model-grader/llm-judge-v1",
"score": 0,
"pass": false,
"reasoning": "ModelGrader evaluate failed: <error message>",
"metadata": { "error": true }
}This ensures pipeline safety - a failed model grader does not block the eval run.
Registering in an AgentCard
evalConfig:
enabled: true
graders:
- structural/zod-v1 # runs first - zero cost
- model-grader/llm-judge-v1 # runs after structural passesFor high-volume agents, consider running the structural grader alone first and only triggering the model grader when the structural check passes.
Related pages
- Structural Grader - deterministic zero-cost grader; run before model grader
- Human Grader - operator review for cases where LLM judgment is insufficient
- Eval Emission - how graders are triggered after session completion
- Eval Replay - re-run with a different model profile without re-running the session
- BFSI Eval Mode - compliance requirements for regulated environments