Building a Local RAG System for Private Document Interaction

The appeal of a local RAG system is simple: you get LLM-quality answers over your private documents without a single byte leaving your machine. No API keys, no cloud costs, no data exposure.

This post walks through a complete implementation using Ollama (llama3), LangChain for document parsing, and ChromaDB for local vector storage.

Architecture

PDF files
    │
    ▼
LangChain loader → text chunks
    │
    ▼
Ollama embeddings → ChromaDB (persistent, local)
    │
    ▼
query → embed → semantic search → top-k chunks → llama3 → answer

Three components do most of the work: a document loader that handles the messy reality of PDFs, a vector store that persists across restarts, and a query interface that ties them together.

Loading and chunking PDFs

Unstructured PDFs are the hardest input format — they can contain images, tables, multi-column layouts, and arbitrary whitespace. LangChain's PyPDFLoader handles the extraction; RecursiveCharacterTextSplitter handles the chunking.

from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
 
def load_pdf(path: str) -> list[str]:
    loader = PyPDFLoader(path)
    pages = loader.load()
 
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=500,
        chunk_overlap=50,
        separators=["\n\n", "\n", ". ", " "],
    )
    chunks = splitter.split_documents(pages)
    return [c.page_content for c in chunks]

The separator order matters. The splitter tries each separator in sequence, falling back to the next if the chunk is still too large. Paragraph breaks first, then newlines, then sentence boundaries.

Vector storage with ChromaDB

ChromaDB gives you a persistent local vector store with a single dependency. One critical detail: the embedding function you use at collection creation must match the one you use at query time. Mismatched functions produce silently wrong results.

import chromadb
from chromadb.utils import embedding_functions
 
CHROMA_PATH = "./chroma_store"
COLLECTION_NAME = "documents"
 
class ChromaDBManager:
    def __init__(self) -> None:
        self.client = chromadb.PersistentClient(path=CHROMA_PATH)
        self.ef = embedding_functions.OllamaEmbeddingFunction(
            url="http://localhost:11434/api/embeddings",
            model_name="llama3",
        )
        self.collection = self.client.get_or_create_collection(
            name=COLLECTION_NAME,
            embedding_function=self.ef,
        )
 
    def add(self, chunks: list[str], doc_id: str) -> None:
        ids = [f"{doc_id}_{i}" for i in range(len(chunks))]
        self.collection.add(documents=chunks, ids=ids)
 
    def query(self, text: str, n_results: int = 5) -> list[str]:
        results = self.collection.query(
            query_texts=[text],
            n_results=n_results,
        )
        return results["documents"][0]

PersistentClient writes to disk so your embeddings survive process restarts. Run add() once per document; query as many times as you want.

Generating answers

With the relevant chunks retrieved, the generation step is straightforward: format a prompt, call Ollama, return the response.

import requests
 
OLLAMA_URL = "http://localhost:11434/api/generate"
 
def generate(query: str, context_chunks: list[str]) -> str:
    context = "\n\n".join(context_chunks)
    prompt = (
        "Answer the question using only the context below. "
        "If the context does not contain enough information, say so.\n\n"
        f"Context:\n{context}\n\n"
        f"Question: {query}"
    )
    response = requests.post(
        OLLAMA_URL,
        json={"model": "llama3", "prompt": prompt, "stream": False},
    )
    return response.json()["response"]

Grounding the model in the retrieved context with an explicit instruction ("answer using only the context below") dramatically reduces hallucination compared to bare generation.

Putting it together

def ask(query: str) -> str:
    db = ChromaDBManager()
    chunks = db.query(query)
    return generate(query, chunks)

CLI interface

if __name__ == "__main__":
    import sys
 
    if len(sys.argv) > 1 and sys.argv[1] == "ingest":
        db = ChromaDBManager()
        for path in sys.argv[2:]:
            chunks = load_pdf(path)
            db.add(chunks, doc_id=path)
            print(f"Ingested {len(chunks)} chunks from {path}")
    else:
        while True:
            query = input("Query: ").strip()
            if not query:
                break
            print(ask(query))

Usage:

python rag.py ingest report.pdf notes.pdf
python rag.py

FastAPI wrapper

For web or programmatic access, a two-endpoint FastAPI app covers ingest and query:

from fastapi import FastAPI, UploadFile
import shutil, tempfile
 
app = FastAPI()
db = ChromaDBManager()
 
@app.post("/ingest")
async def ingest(file: UploadFile) -> dict:
    with tempfile.NamedTemporaryFile(suffix=".pdf", delete=False) as tmp:
        shutil.copyfileobj(file.file, tmp)
        chunks = load_pdf(tmp.name)
    db.add(chunks, doc_id=file.filename or "upload")
    return {"chunks_indexed": len(chunks)}
 
@app.post("/query")
async def query(body: dict) -> dict:
    answer = ask(body["question"])
    return {"answer": answer}

What this is good for

Private documents you cannot send to an API: legal filings, internal specs, medical records, proprietary research. The whole stack — Ollama, ChromaDB, this code — runs on a laptop with no outbound traffic.

The main limitation is PDF quality. Dense tables, scanned images, and non-standard layouts degrade extraction quality upstream, before any embedding or retrieval happens. For production-grade pipelines, preprocessing matters as much as the RAG logic itself.