Memory Diagnostics

Memory diagnostics records a structured trace for every session that touches injected observations and, for failed sessions, applies aggressive downweighting to observations that are consistently misleading. This closes the feedback loop between session outcomes and the retrieval weights that determine which observations agents see in the future.

SessionDiagnostic

A SessionDiagnostic row is written to session_diagnostics at the end of every session. It captures:

interface SessionDiagnostic {
  sessionId: string
  outcome: SessionOutcome              // 'accepted' | 'rejected' | 'rework'
  injectedObservationIds: string[]     // observations delivered at session start
  injectedGraphEntityIds: string[]     // graph node UUIDs surfaced to agent
  injectedGraphEdgeKeys?: Array<{      // graph edges - used by GraphFeedbackService
    sourceId: string
    targetId: string
    relationshipName: string
  }>
  toolCallSummary: { tool: string; file?: string; success: boolean }[]
  filesModified: string[]
  errorsEncountered: { message: string; tool: string; recoverable: boolean }[]
  failureCategory?: FailureCategory
  failureContext?: string
  helpfulObservationIds?: string[]
  misleadingObservationIds?: string[]
  orgId?: string
  projectId?: string
}

For accepted sessions a lightweight diagnostic is written (no tool/error detail). For rejected or rework sessions a full diagnostic is written, including a deterministically-classified failure category.

Failure categories

Failure categories are computed by classifyFailure with no LLM call, using simple heuristics evaluated in priority order:

Category	Rule
`test_failure`	Any test-runner tool (`jest`, `vitest`, `pytest`, `go_test`, etc.) returned `success: false`
`style`	Any error message matches `/lint\|format\|prettier\|eslint/i`
`regression`	Any error message matches `/regression\|broke\|previously/i`
`incomplete`	`filesModified.length === 0`
`wrong_approach`	`errorsEncountered.length > 0` (with none of the above)
`other`	None of the above matched

These categories power the failure-category distribution query used in the Performance Analytics views.

Misleading observation detection

An observation is considered misleading when it has appeared in 3 or more failed sessions (rejected or rework outcome) and zero successful sessions (accepted outcome). This pattern indicates the observation is actively leading agents astray.

const misleading = await findMisleadingObservations({
  orgId: 'org_123',
  since: new Date('2026-01-01'),
})
// Returns string[] of observation UUIDs

The detection query runs a cross-session aggregation against session_diagnostics by expanding the injected_observation_ids JSONB array:

WITH expanded AS (
  SELECT obs_id, outcome
  FROM session_diagnostics,
       jsonb_array_elements_text(injected_observation_ids) AS obs_id
),
counts AS (
  SELECT obs_id,
         COUNT(*) FILTER (WHERE outcome IN ('rejected', 'rework')) AS fail_count,
         COUNT(*) FILTER (WHERE outcome = 'accepted') AS success_count
  FROM expanded
  GROUP BY obs_id
)
SELECT obs_id FROM counts
WHERE fail_count >= 3 AND success_count = 0

Aggressive 2x downweight

When misleading observations are detected, applyMisleadingObservationDownweight applies an EMA update with double the normal failure alpha:

newWeight = EMA(previousWeight, 'rejected', alphaFailure × 2)

This means misleading observations sink in the ranking much faster than ordinary failed-session observations. The aggressive weight update is:

Written to observations.weight.
Logged to observation_weight_history with outcome = 'rejected' and the doubled alpha.
Emitted as a memory.feedback audit event with metadata noting the aggressive downweight.

recordSessionDiagnostic - primary entry point

Call recordSessionDiagnostic once at session completion. It handles the full sequence automatically:

import { recordSessionDiagnostic } from '@/lib/memory/diagnostics'

const { id, aggressiveDownweightApplied } = await recordSessionDiagnostic({
  sessionId: 'ses_abc',
  outcome: 'rejected',
  injectedObservationIds: ['obs_1', 'obs_2', 'obs_3'],
  injectedGraphEntityIds: ['node_x'],
  injectedGraphEdgeKeys: [],
  toolCallSummary: [
    { tool: 'vitest', file: 'src/foo.test.ts', success: false },
  ],
  filesModified: ['src/foo.ts'],
  errorsEncountered: [
    { message: '2 tests failed', tool: 'vitest', recoverable: false },
  ],
  orgId: 'org_123',
  projectId: 'proj_xyz',
})

console.log(`diagnostic id: ${id}`)
console.log(`observations aggressively downweighted: ${aggressiveDownweightApplied}`)

The function is best-effort - all DB writes are wrapped in try/catch and never propagate exceptions back to the session completion path.

Failure-category distribution

failureCategoryDistribution provides an aggregated breakdown for analytics:

import { failureCategoryDistribution } from '@/lib/memory/diagnostics'

const dist = await failureCategoryDistribution({
  orgId: 'org_123',
  projectId: 'proj_xyz',
  since: new Date('2026-05-01'),
})
// {
//   test_failure: 12,
//   style: 3,
//   regression: 1,
//   incomplete: 0,
//   wrong_approach: 7,
//   other: 2,
//   unknown: 0,
// }

This powers the failure breakdown chart in the performance analytics dashboard.

Graph edge feedback

injectedGraphEdgeKeys is populated from context_injection_logs.graph_edge_keys at session start. At session completion, the diagnostics module feeds these edge keys to GraphFeedbackService.applyFeedback so knowledge graph edge weights are updated to reflect whether the surfaced architectural knowledge was helpful or misleading.

Feedback Retrieval - the EMA weight updates these diagnostics feed
Retrieval Ranking - the ranking that is improved by accurate weights
Context Injection - where injectedObservationIds originate
Feedback Retention Audit - the observation_weight_history table and audit trail
Agent Reliability - the analytics surface for failure categories

On this page