Entity Resolution
Three-stage graph entity resolution.
Entity resolution prevents duplicate nodes from accumulating in the knowledge graph when the same real-world concept (a module, service, or architectural decision) is extracted from multiple observations independently. The resolver runs as part of the extraction pipeline and returns the canonical node ID to use for upsert.
Three-stage process
Stage 1 - UUID5 deterministic dedup
For every extracted node, generateEntityId(type, { name }) computes a deterministic UUID5 from the node's (type, normalised-name) tuple. If a node with this ID already exists in the store, the resolver short-circuits and returns it immediately - no fuzzy search or LLM call is needed.
This handles the common case where the same entity (e.g. the AuthService module) is referenced in multiple observations. The same inputs always produce the same UUID5, so re-extraction is a no-op at the DB level (upsert).
Stage 2 - Embedding similarity
When the UUID5 ID is not yet in the store and an embedding vector is available, the resolver queries findFuzzyDuplicates against the pgvector index with a default similarity threshold of 0.92. The query returns up to 20 existing nodes whose embedding is within that cosine distance.
If one or more matches are found, the best match (highest cosine similarity) becomes a merge candidate. Without Stage 3, the resolver trusts the embedding match alone and returns the existing node's ID - the extracted node merges into the existing one.
Stage 3 - LLM confirmation (optional)
When llm is provided to the resolver and Stage 2 found a fuzzy match, the resolver asks the LLM whether the two entity descriptions refer to the same real-world entity:
Entity A: { name, type, description }
Entity B: { name, type, description }
Are these the same entity? { "same": true|false, "reasoning": "..." }Failures default to false (conservative - do not merge on uncertainty). When llm is absent, embedding similarity alone determines the merge decision.
API
interface ResolveEntityOptions {
store: IGraphStore
node: ExtractedNode // extracted name + type + description
embedding?: number[] // optional, enables Stage 2
llm?: LlmExtractor // optional, enables Stage 3
scope: { orgId: string; projectId: string }
fuzzyThreshold?: number // default 0.92
}
interface ResolveResult {
resolvedId: string // canonical node ID to use for upsert
isMerge: boolean // true when merged with an existing node
mergeWith?: string // existing node ID (only when isMerge = true)
}
const result = await resolveEntity(opts)
// result.resolvedId is always the ID to pass to store.upsertNode(...)Merge semantics
When isMerge = true, the pipeline upserts the extracted content onto the existing node's UUID. The merge is last-write-wins on description and properties; feedbackWeight and importanceWeight accumulate from the original row via the upsert's ON CONFLICT DO UPDATE clause.
No separate merge tracking table exists. Each upsert silently enriches the canonical node rather than forking it.
Threshold tuning
The default fuzzy threshold of 0.92 is conservative. At this similarity, nodes must be nearly identical in embedding space before they merge. Lowering the threshold (e.g. to 0.85) captures more merges but risks conflating distinct entities - particularly for short names like config or utils. The threshold is a per-call parameter; you can pass a different value when constructing the extraction pipeline for a specific observation type.
Degenerate cases
| Scenario | Outcome |
|---|---|
| No embedding, UUID5 miss | New node inserted at canonical UUID5 ID |
| Embedding available, no matches above threshold | New node inserted |
Embedding match found, llm absent | Merge into best match |
Embedding match found, llm says false | New node inserted at canonical UUID5 ID |
resolveEntity throws | Extraction pipeline logs warning, returns zero entities for this observation |
Related pages
- Extraction Pipeline - orchestrates strategy → resolve → store
- Knowledge Graph Store - pgvector backend and IGraphStore API