Secure RAG MVP — building a document search system that doesn't hallucinate¶

Everyone is building RAG systems right now. Most of them have the same problem: the AI sounds confident and is wrong.

I wanted to understand the problem properly, so I built one from scratch — no LangChain, no framework magic, just the components and the decisions they force you to make.

What it does¶

You upload PDFs or text files. You ask questions in natural language. You get answers — with citations, page numbers, and exact quotes from the source material.

The constraint that shapes everything: the system refuses to answer from memory. If the information is not in the retrieved documents, the response is "information not available". No fabrication.

This sounds simple. Making it work reliably across different document types is not.

The retrieval problem¶

The core challenge in RAG is not generation — GPT-4o-mini is good at producing coherent text. The challenge is retrieval: finding the right chunks of source material before the model ever sees the question.

Semantic search (vector similarity) alone is not enough. Here is a concrete example.

Query: FV/2025/01/0847 (an invoice number lookup)

With pure semantic search, relevance score: 0.338. The correct chunk was not retrieved. The answer: "No information in documents."

With hybrid search, relevance score: 0.885. Correct chunk ranked first. Answer: "Invoice number is FV/2025/01/0847" with citation.

Semantic search finds passages that are conceptually similar to the question. It understands meaning. But it misses exact codes, reference numbers, and technical identifiers — the things that matter most in document QA. BM25 keyword search catches those. The hybrid approach combines both:

# 70% semantic weight, 30% keyword weight
combined_score = 0.7 * semantic_score + 0.3 * bm25_score

The weights came from evaluation against a test set, not from intuition.

The numbers¶

I evaluated against 30 questions across three document types: an invoice, a technical manual, and a contract.

Metric	Semantic only	Hybrid search
Context retrieval	67% (20/30)	83% (25/30)
Citation accuracy	60% (18/30)	70% (21/30)
Exact match score	0.338	0.885

The exact match improvement is where BM25 earns its weight. Codes, amounts, dates — semantic search misses them, keyword search finds them.

The architecture¶

Three design decisions that shaped everything:

Local embeddings. sentence-transformers (all-MiniLM-L6-v2) generates 384-dimensional vectors locally. No API call, no per-token cost, no network dependency on the critical path. OpenAI embeddings are available as a fallback but local runs first.

Chunk overlap. Documents split into 2000-character chunks with 300-character overlap. I started at 1000 characters — too small, context was lost across chunks. 2000 characters with overlap keeps answers from falling between the gaps.

No LangChain. A deliberate choice. LangChain is excellent for moving fast and prototyping. For this project I wanted to understand each layer — chunking strategy, ranking algorithm, prompt structure — without a framework abstracting the decisions away. When retrieval fails, you need to know exactly why. When you own every layer, you do.

The full pipeline:

PDF/text extraction (PyPDF + pdfplumber for complex layouts)
Chunking with overlap
Embedding generation — local, sentence-transformers
Storage in PostgreSQL with pgvector extension
At query time: hybrid retrieval (BM25 + vector) → context assembly → GPT-4o-mini → answer with citations

Stack: Python · FastAPI · PostgreSQL · pgvector · sentence-transformers · OpenAI GPT-4o-mini · Docker · Poetry

The evaluation framework¶

I built a scoring system before measuring anything: 30 questions, three criteria per question (correctness, citation quality, completeness), 0–2 points each. 180 points maximum.

This forced two things. First, I had to write the questions before seeing the results — no unconscious selection of easy cases. Second, when I changed the retrieval algorithm, I had a number. Not a feeling that it was better. A number.

That discipline — build the measurement before you run the experiment — comes from QA. It applies here exactly as it does in software testing.

What's next¶

The project is open source: github.com/andrzejoblong/secure-rag-mvp

What I want to explore next: streaming responses, LangGraph for multi-step reasoning, and deployment to AWS EKS. The foundation is solid enough that those are extensions, not rewrites.

If you are building something with RAG — document QA, knowledge bases, anything where citation accuracy matters — and want to talk through the architecture, get in touch.

Stack: Python · FastAPI · PostgreSQL · pgvector · sentence-transformers · GPT-4o-mini · Docker
Source: github.com/andrzejoblong/secure-rag-mvp