Memory Diagnostics
Session diagnostics and downweighting.
Memory diagnostics records a structured trace for every session that touches injected observations and, for failed sessions, applies aggressive downweighting to observations that are consistently misleading. This closes the feedback loop between session outcomes and the retrieval weights that determine which observations agents see in the future.
SessionDiagnostic
A SessionDiagnostic row is written to session_diagnostics at the end of every session. It captures:
interface SessionDiagnostic {
sessionId: string
outcome: SessionOutcome // 'accepted' | 'rejected' | 'rework'
injectedObservationIds: string[] // observations delivered at session start
injectedGraphEntityIds: string[] // graph node UUIDs surfaced to agent
injectedGraphEdgeKeys?: Array<{ // graph edges - used by GraphFeedbackService
sourceId: string
targetId: string
relationshipName: string
}>
toolCallSummary: { tool: string; file?: string; success: boolean }[]
filesModified: string[]
errorsEncountered: { message: string; tool: string; recoverable: boolean }[]
failureCategory?: FailureCategory
failureContext?: string
helpfulObservationIds?: string[]
misleadingObservationIds?: string[]
orgId?: string
projectId?: string
}For accepted sessions a lightweight diagnostic is written (no tool/error detail). For rejected or rework sessions a full diagnostic is written, including a deterministically-classified failure category.
Failure categories
Failure categories are computed by classifyFailure with no LLM call, using simple heuristics evaluated in priority order:
| Category | Rule |
|---|---|
test_failure | Any test-runner tool (jest, vitest, pytest, go_test, etc.) returned success: false |
style | Any error message matches /lint|format|prettier|eslint/i |
regression | Any error message matches /regression|broke|previously/i |
incomplete | filesModified.length === 0 |
wrong_approach | errorsEncountered.length > 0 (with none of the above) |
other | None of the above matched |
These categories power the failure-category distribution query used in the Performance Analytics views.
Misleading observation detection
An observation is considered misleading when it has appeared in 3 or more failed sessions (rejected or rework outcome) and zero successful sessions (accepted outcome). This pattern indicates the observation is actively leading agents astray.
const misleading = await findMisleadingObservations({
orgId: 'org_123',
since: new Date('2026-01-01'),
})
// Returns string[] of observation UUIDsThe detection query runs a cross-session aggregation against session_diagnostics by expanding the injected_observation_ids JSONB array:
WITH expanded AS (
SELECT obs_id, outcome
FROM session_diagnostics,
jsonb_array_elements_text(injected_observation_ids) AS obs_id
),
counts AS (
SELECT obs_id,
COUNT(*) FILTER (WHERE outcome IN ('rejected', 'rework')) AS fail_count,
COUNT(*) FILTER (WHERE outcome = 'accepted') AS success_count
FROM expanded
GROUP BY obs_id
)
SELECT obs_id FROM counts
WHERE fail_count >= 3 AND success_count = 0Aggressive 2x downweight
When misleading observations are detected, applyMisleadingObservationDownweight applies an EMA update with double the normal failure alpha:
newWeight = EMA(previousWeight, 'rejected', alphaFailure × 2)This means misleading observations sink in the ranking much faster than ordinary failed-session observations. The aggressive weight update is:
- Written to
observations.weight. - Logged to
observation_weight_historywithoutcome = 'rejected'and the doubled alpha. - Emitted as a
memory.feedbackaudit event with metadata noting the aggressive downweight.
recordSessionDiagnostic - primary entry point
Call recordSessionDiagnostic once at session completion. It handles the full sequence automatically:
import { recordSessionDiagnostic } from '@/lib/memory/diagnostics'
const { id, aggressiveDownweightApplied } = await recordSessionDiagnostic({
sessionId: 'ses_abc',
outcome: 'rejected',
injectedObservationIds: ['obs_1', 'obs_2', 'obs_3'],
injectedGraphEntityIds: ['node_x'],
injectedGraphEdgeKeys: [],
toolCallSummary: [
{ tool: 'vitest', file: 'src/foo.test.ts', success: false },
],
filesModified: ['src/foo.ts'],
errorsEncountered: [
{ message: '2 tests failed', tool: 'vitest', recoverable: false },
],
orgId: 'org_123',
projectId: 'proj_xyz',
})
console.log(`diagnostic id: ${id}`)
console.log(`observations aggressively downweighted: ${aggressiveDownweightApplied}`)The function is best-effort - all DB writes are wrapped in try/catch and never propagate exceptions back to the session completion path.
Failure-category distribution
failureCategoryDistribution provides an aggregated breakdown for analytics:
import { failureCategoryDistribution } from '@/lib/memory/diagnostics'
const dist = await failureCategoryDistribution({
orgId: 'org_123',
projectId: 'proj_xyz',
since: new Date('2026-05-01'),
})
// {
// test_failure: 12,
// style: 3,
// regression: 1,
// incomplete: 0,
// wrong_approach: 7,
// other: 2,
// unknown: 0,
// }This powers the failure breakdown chart in the performance analytics dashboard.
Graph edge feedback
injectedGraphEdgeKeys is populated from context_injection_logs.graph_edge_keys at session start. At session completion, the diagnostics module feeds these edge keys to GraphFeedbackService.applyFeedback so knowledge graph edge weights are updated to reflect whether the surfaced architectural knowledge was helpful or misleading.
Related pages
- Feedback Retrieval - the EMA weight updates these diagnostics feed
- Retrieval Ranking - the ranking that is improved by accurate weights
- Context Injection - where
injectedObservationIdsoriginate - Feedback Retention Audit - the
observation_weight_historytable and audit trail - Agent Reliability - the analytics surface for failure categories