Entity Resolution

Entity resolution prevents duplicate nodes from accumulating in the knowledge graph when the same real-world concept (a module, service, or architectural decision) is extracted from multiple observations independently. The resolver runs as part of the extraction pipeline and returns the canonical node ID to use for upsert.

Three-stage process

Stage 1 - UUID5 deterministic dedup

For every extracted node, generateEntityId(type, { name }) computes a deterministic UUID5 from the node's (type, normalised-name) tuple. If a node with this ID already exists in the store, the resolver short-circuits and returns it immediately - no fuzzy search or LLM call is needed.

This handles the common case where the same entity (e.g. the AuthService module) is referenced in multiple observations. The same inputs always produce the same UUID5, so re-extraction is a no-op at the DB level (upsert).

Stage 2 - Embedding similarity

When the UUID5 ID is not yet in the store and an embedding vector is available, the resolver queries findFuzzyDuplicates against the pgvector index with a default similarity threshold of 0.92. The query returns up to 20 existing nodes whose embedding is within that cosine distance.

If one or more matches are found, the best match (highest cosine similarity) becomes a merge candidate. Without Stage 3, the resolver trusts the embedding match alone and returns the existing node's ID - the extracted node merges into the existing one.

Stage 3 - LLM confirmation (optional)

When llm is provided to the resolver and Stage 2 found a fuzzy match, the resolver asks the LLM whether the two entity descriptions refer to the same real-world entity:

Entity A: { name, type, description }
Entity B: { name, type, description }

Are these the same entity? { "same": true|false, "reasoning": "..." }

Failures default to false (conservative - do not merge on uncertainty). When llm is absent, embedding similarity alone determines the merge decision.

API

interface ResolveEntityOptions {
  store: IGraphStore
  node: ExtractedNode           // extracted name + type + description
  embedding?: number[]          // optional, enables Stage 2
  llm?: LlmExtractor            // optional, enables Stage 3
  scope: { orgId: string; projectId: string }
  fuzzyThreshold?: number       // default 0.92
}

interface ResolveResult {
  resolvedId: string            // canonical node ID to use for upsert
  isMerge: boolean              // true when merged with an existing node
  mergeWith?: string            // existing node ID (only when isMerge = true)
}

const result = await resolveEntity(opts)
// result.resolvedId is always the ID to pass to store.upsertNode(...)

Merge semantics

When isMerge = true, the pipeline upserts the extracted content onto the existing node's UUID. The merge is last-write-wins on description and properties; feedbackWeight and importanceWeight accumulate from the original row via the upsert's ON CONFLICT DO UPDATE clause.

No separate merge tracking table exists. Each upsert silently enriches the canonical node rather than forking it.

Threshold tuning

The default fuzzy threshold of 0.92 is conservative. At this similarity, nodes must be nearly identical in embedding space before they merge. Lowering the threshold (e.g. to 0.85) captures more merges but risks conflating distinct entities - particularly for short names like config or utils. The threshold is a per-call parameter; you can pass a different value when constructing the extraction pipeline for a specific observation type.

Degenerate cases

Scenario	Outcome
No embedding, UUID5 miss	New node inserted at canonical UUID5 ID
Embedding available, no matches above threshold	New node inserted
Embedding match found, `llm` absent	Merge into best match
Embedding match found, `llm` says `false`	New node inserted at canonical UUID5 ID
`resolveEntity` throws	Extraction pipeline logs warning, returns zero entities for this observation

Extraction Pipeline - orchestrates strategy → resolve → store
Knowledge Graph Store - pgvector backend and IGraphStore API

On this page