Skip to content

Secure RAG MVP — building a document search system that doesn't hallucinate

Everyone is building RAG systems right now. Most of them have the same problem: the AI sounds confident and is wrong.

I wanted to understand the problem properly, so I built one from scratch — no LangChain, no framework magic, just the components and the decisions they force you to make.


What it does

You upload PDFs or text files. You ask questions in natural language. You get answers — with citations, page numbers, and exact quotes from the source material.

The constraint that shapes everything: the system refuses to answer from memory. If the information is not in the retrieved documents, the response is "information not available". No fabrication.

This sounds simple. Making it work reliably across different document types is not.


The retrieval problem

The core challenge in RAG is not generation — GPT-4o-mini is good at producing coherent text. The challenge is retrieval: finding the right chunks of source material before the model ever sees the question.

Semantic search (vector similarity) alone is not enough. Here is a concrete example.

Query: FV/2025/01/0847 (an invoice number lookup)

With pure semantic search, relevance score: 0.338. The correct chunk was not retrieved. The answer: "No information in documents."

With hybrid search, relevance score: 0.885. Correct chunk ranked first. Answer: "Invoice number is FV/2025/01/0847" with citation.

Semantic search finds passages that are conceptually similar to the question. It understands meaning. But it misses exact codes, reference numbers, and technical identifiers — the things that matter most in document QA. BM25 keyword search catches those. The hybrid approach combines both:

# 70% semantic weight, 30% keyword weight
combined_score = 0.7 * semantic_score + 0.3 * bm25_score

The weights came from evaluation against a test set, not from intuition.


The numbers

I evaluated against 30 questions across three document types: an invoice, a technical manual, and a contract.

Metric Semantic only Hybrid search
Context retrieval 67% (20/30) 83% (25/30)
Citation accuracy 60% (18/30) 70% (21/30)
Exact match score 0.338 0.885

The exact match improvement is where BM25 earns its weight. Codes, amounts, dates — semantic search misses them, keyword search finds them.


The architecture

Three design decisions that shaped everything:

Local embeddings. sentence-transformers (all-MiniLM-L6-v2) generates 384-dimensional vectors locally. No API call, no per-token cost, no network dependency on the critical path. OpenAI embeddings are available as a fallback but local runs first.

Chunk overlap. Documents split into 2000-character chunks with 300-character overlap. I started at 1000 characters — too small, context was lost across chunks. 2000 characters with overlap keeps answers from falling between the gaps.

No LangChain. A deliberate choice. LangChain is excellent for moving fast and prototyping. For this project I wanted to understand each layer — chunking strategy, ranking algorithm, prompt structure — without a framework abstracting the decisions away. When retrieval fails, you need to know exactly why. When you own every layer, you do.

The full pipeline:

  1. PDF/text extraction (PyPDF + pdfplumber for complex layouts)
  2. Chunking with overlap
  3. Embedding generation — local, sentence-transformers
  4. Storage in PostgreSQL with pgvector extension
  5. At query time: hybrid retrieval (BM25 + vector) → context assembly → GPT-4o-mini → answer with citations

Stack: Python · FastAPI · PostgreSQL · pgvector · sentence-transformers · OpenAI GPT-4o-mini · Docker · Poetry


The evaluation framework

I built a scoring system before measuring anything: 30 questions, three criteria per question (correctness, citation quality, completeness), 0–2 points each. 180 points maximum.

This forced two things. First, I had to write the questions before seeing the results — no unconscious selection of easy cases. Second, when I changed the retrieval algorithm, I had a number. Not a feeling that it was better. A number.

That discipline — build the measurement before you run the experiment — comes from QA. It applies here exactly as it does in software testing.


What's next

The project is open source: github.com/andrzejoblong/secure-rag-mvp

What I want to explore next: streaming responses, LangGraph for multi-step reasoning, and deployment to AWS EKS. The foundation is solid enough that those are extensions, not rewrites.

If you are building something with RAG — document QA, knowledge bases, anything where citation accuracy matters — and want to talk through the architecture, get in touch.


Stack: Python · FastAPI · PostgreSQL · pgvector · sentence-transformers · GPT-4o-mini · Docker
Source: github.com/andrzejoblong/secure-rag-mvp