Enterprise Rag With Deterministic Citations Using Postgresql Trigram Search

The problem I ran into: “answers” with citations that couldn’t be trusted

I built a small Enterprise RAG (Retrieval-Augmented Generation) pipeline to answer questions over internal documents. It worked great—until I tried to verify citations.

Same question, same documents, but different reruns produced different “source” links. Not because the answer content was wildly different, but because the retrieval step returned slightly different chunks, which then fed the generator.

For real enterprise usage, that kind of nondeterminism is painful: it makes audits, reviews, and incident investigations harder than they should be.

What I wanted was a retrieval layer that is deterministic and produces stable citations, even when I’m using fuzzy matching.

So I explored a niche approach: deterministic chunk matching + stable ordering using PostgreSQL trigram search (pg_trgm), then storing citation metadata with an explicit ordering key.

Key idea: stable retrieval ordering before generation

Enterprise RAG typically follows this loop:

Retrieve relevant text chunks from a document store (often via embeddings or keyword search).
Generate an answer by feeding the retrieved chunks into a language model.

The critical failure mode I hit was step 1: retrieval returning a different set or order of chunks across runs.

My fix was to make retrieval deterministic by combining:

pg_trgm for fuzzy matching (useful when users misspell or paraphrase terms)
a strict tie-breaking sort key (a deterministic ordering column)
bounded selection (take top K after deterministic ordering)
citation IDs tied to chunk rows (so citations are stable even if the text changes later)

Setup: install and initialize PostgreSQL trigram support

1) Enable `pg_trgm`

CREATE EXTENSION IF NOT EXISTS pg_trgm;

2) Create a table for chunks and deterministic ordering

I store each chunk as a row with:

chunk_id: stable primary key
doc_id: document identifier
chunk_index: deterministic order within a doc (e.g., chunk 0, 1, 2…)
text: the chunk content
created_at: used only as metadata (not for ordering)
search_fingerprint: a normalized token bag used to reduce ambiguity (optional but helpful)

CREATE TABLE IF NOT EXISTS rag_chunks (
  chunk_id        BIGSERIAL PRIMARY KEY,
  doc_id          TEXT NOT NULL,
  chunk_index    INT NOT NULL,
  text            TEXT NOT NULL,
  created_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
  -- A deterministic fingerprint to reduce “equal score” instability
  -- (Here: a simple normalization; production systems might use a better one.)
  search_fingerprint TEXT NOT NULL
);

CREATE INDEX IF NOT EXISTS rag_chunks_doc_id_idx
  ON rag_chunks(doc_id);

-- Trigram index for fuzzy matching on text
CREATE INDEX IF NOT EXISTS rag_chunks_text_trgm_idx
  ON rag_chunks USING GIN (text gin_trgm_ops);

3) Insert some example chunks

INSERT INTO rag_chunks (doc_id, chunk_index, text, search_fingerprint)
VALUES
('HR-POLICY-001', 0, 'Vacation accrual is based on hours worked.', 'vacation accrual based on hours worked'),
('HR-POLICY-001', 1, 'Requests must be submitted at least two weeks in advance.', 'requests must be submitted at least two weeks in advance'),
('SECURITY-101', 0, 'All production systems require MFA for administrative access.', 'all production systems require mfa for administrative access'),
('SECURITY-101', 1, 'Service accounts must use least-privilege permissions.', 'service accounts must use least-privilege permissions');

The deterministic retrieval query (the important part)

Here’s the retrieval SQL I used. It performs fuzzy matching with similarity() and then sorts deterministically:

Primary sort: similarity(text, query) descending
Secondary sort: chunk_index ascending
Tertiary sort: chunk_id ascending (absolute tie-breaker)

WITH ranked AS (
  SELECT
    chunk_id,
    doc_id,
    chunk_index,
    text,
    similarity(text, $1) AS score
  FROM rag_chunks
  WHERE similarity(text, $1) > 0.1
)
SELECT
  chunk_id,
  doc_id,
  chunk_index,
  text,
  score
FROM ranked
ORDER BY
  score DESC,
  chunk_index ASC,
  chunk_id ASC
LIMIT $2;

Why this is deterministic

In practice, similarity scores can tie (or be extremely close). Without a tie-breaker, databases are allowed to return tied rows in effectively arbitrary plans/orderings.

By adding:

chunk_index ASC
chunk_id ASC

…I force a stable ordering whenever scores are equal.

That means I can rerun the same retrieval and get the exact same citation set and ordering.

End-to-end Python example: query → retrieve → generate with stable citations

Below is a complete, working script using:

psycopg for PostgreSQL
a simple deterministic retrieval query
a placeholder “generator” step that shows how citations are assembled

Note: I’m not calling an external LLM here to keep the example runnable offline. The focus is retrieval determinism and citation stability.

import os
import psycopg
from dataclasses import dataclass
from typing import List

@dataclass
class Chunk:
    chunk_id: int
    doc_id: str
    chunk_index: int
    text: str
    score: float

def retrieve_chunks(conn, query: str, k: int) -> List[Chunk]:
    sql = """
    WITH ranked AS (
      SELECT
        chunk_id,
        doc_id,
        chunk_index,
        text,
        similarity(text, %(q)s) AS score
      FROM rag_chunks
      WHERE similarity(text, %(q)s) > 0.1
    )
    SELECT
      chunk_id,
      doc_id,
      chunk_index,
      text,
      score
    FROM ranked
    ORDER BY
      score DESC,
      chunk_index ASC,
      chunk_id ASC
    LIMIT %(k)s;
    """

    with conn.cursor() as cur:
        cur.execute(sql, {"q": query, "k": k})
        rows = cur.fetchall()

    return [
        Chunk(
            chunk_id=row[0],
            doc_id=row[1],
            chunk_index=row[2],
            text=row[3],
            score=float(row[4]),
        )
        for row in rows
    ]

def generate_answer_with_citations(query: str, chunks: List[Chunk]) -> str:
    """
    This is a simple “generator” for demonstration.
    It formats a response using the retrieved chunks and stable citation IDs.
    """
    context_lines = []
    for i, c in enumerate(chunks, start=1):
        citation_id = f"{c.doc_id}#chunk{c.chunk_index}"
        context_lines.append(f"[{i}] {citation_id}: {c.text}")

    # Deterministic formatting ensures the same citations appear in the same order.
    answer = (
        f"Question: {query}\n\n"
        f"Retrieved context (deterministic order):\n"
        + "\n".join(context_lines)
        + "\n\n"
        "Based on the retrieved context above, here is the consolidated answer:\n"
        "- This response is assembled from the listed chunks with stable citations.\n"
    )
    return answer

def main():
    dsn = os.environ.get("DATABASE_URL")
    if not dsn:
        raise RuntimeError("Set DATABASE_URL to your PostgreSQL connection string")

    query = "how much vacation do employees get and when should they request it"
    k = 2

    with psycopg.connect(dsn) as conn:
        chunks_run_1 = retrieve_chunks(conn, query, k)
        chunks_run_2 = retrieve_chunks(conn, query, k)

        # Prove determinism by comparing chunk IDs in order
        ids_1 = [(c.chunk_id, c.doc_id, c.chunk_index, c.score) for c in chunks_run_1]
        ids_2 = [(c.chunk_id, c.doc_id, c.chunk_index, c.score) for c in chunks_run_2]

        assert ids_1 == ids_2, "Retrieval order was not deterministic!"

        answer = generate_answer_with_citations(query, chunks_run_1)
        print(answer)

        print("Stable citation IDs in order:")
        for c in chunks_run_1:
            print(f"- {c.doc_id}#chunk{c.chunk_index} (chunk_id={c.chunk_id}, score={c.score:.4f})")

if __name__ == "__main__":
    main()

What happens when I ran it

First retrieval run returned the same top chunks, in the same order.
Second retrieval run returned identical (chunk_id, doc_id, chunk_index) tuples.
The generated response included citations in that same deterministic order.

The practical outcome: even though fuzzy matching is involved, the citation list is stable across reruns.

Where this fits in a real Enterprise RAG stack

This trigram approach is useful when:

you want fuzzy keyword matching (users type like humans)
embeddings are too expensive for every retrieval
you need deterministic ordering for auditability
your documents are fairly “chunkable” and chunk boundaries are fixed

In many stacks, I’ve seen teams combine both:

embeddings retrieval for semantic similarity
trigram retrieval as a tie-breaker or fallback
deterministic ordering across both pathways

But the core lesson remains: citations must be anchored to stable chunk rows and retrieved in a strictly deterministic order.

Conclusion

I learned that the most annoying RAG failures aren’t always about the language model—they’re often about retrieval nondeterminism. By using PostgreSQL pg_trgm with a carefully designed, deterministic ORDER BY (similarity score plus explicit tie-breakers like chunk_index and chunk_id), I made enterprise retrieval produce stable, trustworthy citations across reruns. This small retrieval discipline turned “RAG answers” into something closer to auditable, repeatable system behavior.