Peter Niu
Projects / AI Agent Infrastructure

Document Search

A local-first document search tool for AI — and a worked example of when to keep everything on your own machine.

active
Updated May 20, 2026
15 min read

doc-mcp is what niu-library looks like when nothing leaves your laptop. Same job — make a folder of personal documents searchable by an AI tool — built on the opposite set of choices. No cloud, no API keys, no monthly bill, no network call at search time. A vector database that lives in a single SQLite file on your disk. An embedding model that runs on your own CPU or GPU. An MCP server that talks to Claude through a pipe, not a port.

“Local-first or cloud?” is a question that comes up in every enterprise AI evaluation. Building both sides — doc-mcp for local, niu-library for cloud — is the fastest way to develop a real feel for that trade-off instead of guessing from spec sheets.

This walkthrough does two things. It traces one PDF — a paper you dropped onto your desktop this morning — from your filesystem into a local vector store and back out as a search hit Claude reads. And it puts doc-mcp next to niu-library to make the trade-offs concrete: when does local-first actually win, and when do you need the cloud version anyway?

TL;DR

  • How a local vector search pipeline works end-to-end: text extraction, chunking, local embedding with Ollama, and retrieval from ChromaDB — all without a network call
  • Why stdio transport is the right MCP shape for local tools: no port, no auth, no certificate — just a process pipe
  • When local-first beats cloud-first (privacy, offline, zero cost) and when it doesn’t (search quality, structure-aware chunking, scale)
  • How doc-mcp and niu-library answer the same question in opposite directions — reading them together is the fastest way to develop a feel for this trade-off

The problem: not every document belongs in the cloud

niu-library is the cloud version of this idea. It lives on AWS, embeds with Voyage, stores vectors in Pinecone, and uses a reranker that reads each candidate alongside the query. It’s the better search engine. It’s also a system where every document I ingest leaves my machine and every question I ask becomes an API call.

For my published research collection that’s fine — those PDFs are already public, and the search quality matters. For other things it isn’t: a draft I’m not ready to share, client material under NDA, personal notes. Anything where “send this to three vendors” is the wrong default.

The cloud-first instinct says make a private bucket and trust the access controls. The local-first instinct says don’t send it anywhere. Both are defensible. doc-mcp is the second instinct made operational: the document never touches a network, the embeddings live on your disk, and the search runs on your own machine. Unplug the wifi and nothing changes.

The question this walkthrough answers is how — what does a local-first AI search tool look like under the hood, and what do you give up compared to the cloud version?

The building blocks

Three core components, all running on your laptop.

ComponentRoleWhy this one
OllamaRuns mxbai-embed-large locally over localhost:11434 — text in, 1,024-number vector outNo account, no key, no metering; same job Voyage does in the cloud, done on your own silicon
ChromaDBVector database that persists to a single SQLite file on diskZero infrastructure — no server, no port, no cluster; the simplest vector store to understand before reaching for Pinecone
MCP over stdioJSON-RPC over process stdin/stdout; Claude Desktop spawns the server as a child processNo network, no auth, no port; server lifetime is bounded by the AI tool that launched it
pypdf / python-docx / EbookLibConvert PDF, DOCX, TXT, and EPUB to plain textCovers the common personal-document formats without pulling in heavy dependencies
ingested_files.txtTracks which documents have already been embeddedRe-run the ingest script anytime — it skips files already in the list, so you never pay the embedding cost twice

ChromaDB is the right first vector store to learn. Its entire persistence layer is a SQLite file you can open in any browser. No server, no cluster, no port to manage. Once you understand the embed-store-query loop here, Pinecone adds the operational complexity on top of a mental model you already own — not the other way around.

How a PDF becomes a searchable answer

Picture a file on your desktop: expertise-reversal-2003.pdf. Drop it into doc-mcp’s documents/ folder and run the ingest script. From that moment to the point where Claude can quote a passage back to you, here’s what happens.

expertise-reversal-2003.pdf a file in documents/ on your laptop Extract text pypdf pulls out the text layer of every page Chunk into passages ~1,000 characters each, with overlap so ideas span seams Embed each chunk locally Ollama runs mxbai-embed-large on your CPU/GPU Each chunk becomes a vector of 1,024 numbers OLLAMA · LOCAL Persist to ChromaDB Vectors + chunk text + source filename → a SQLite file on disk filename logged to ingested_ files.txt Later, at query time "what does expertise reversal mean?" Embed query (same model), ChromaDB returns top 5

Each step deserves a look.

Step 1 — text out of the PDF

A PDF is a layout format, not a text format — under the hood it’s a description of where to draw glyphs on a page. pypdf walks the document’s text layer and pulls out characters in roughly the right reading order. For a clean, modern PDF — the kind academic publishers produce — that’s plenty.

For a scanned PDF there is no text layer. pypdf returns empty strings, the chunker has nothing to chunk, and the document silently fails to ingest. The planned fix is to detect missing text layers and run ocrmypdf (a Tesseract wrapper) automatically. Until that lands, scanned documents need to be OCR’d manually first.

Step 2 — chunking

The extracted text is split into passages of roughly a thousand characters each, with small overlap so an idea that straddles a boundary still appears whole in at least one chunk.

This chunker isn’t structure-aware — it treats the document as a single text stream and cuts on character counts. niu-library’s pymupdf4llm pipeline preserves headers and section structure. That’s one of the most concrete differences between the two: good enough for most personal documents, but it shows on textbooks with deep hierarchical structure.

Step 3 — embedding, locally

Each chunk goes through Ollama running mxbai-embed-large — an open-weights model, about 670 MB, downloaded once and run forever. Ollama loads it into memory the first time you call it; subsequent calls hit the same process. Each chunk becomes a vector of 1,024 floating-point numbers, matching the dimensionality of Voyage’s voyage-4 in the cloud version. The numbers are different — different model, different training — but the shape is the same: text in, vector out, similar text means nearby vectors.

What an embedding actually is. The model has learned that “expertise reversal” and “the case where experts learn worse from worked examples than novices do” should land near each other in a 1,024-dimensional space. The individual numbers aren’t meaningful. The direction the vector points encodes the meaning. Search becomes geometry: which stored vector points most similarly to the query vector?

Step 4 — persistence

Chroma writes each vector, its chunk text, and the source filename to a SQLite database under chroma_db/. Gitignored, lives only on your machine, survives restarts. Delete the directory and the index is gone. Back it up and you’ve backed up your search index. There is no server. There is a file.

The filename also gets appended to ingested_files.txt. Next time you run the ingest script, anything already in that list gets skipped — drop new PDFs into documents/ and re-run safely without paying the embedding cost twice.

Step 5 — searching

When Claude calls search_docs("what does expertise reversal mean?"), the server runs the same sequence in reverse: embed the query with the same Ollama model that did ingestion (mismatched models produce vectors in incompatible spaces), hand the vector to Chroma’s query(), get the top five hits back with their text and source filenames, format them into MCP’s response shape, write to stdout. Claude reads them.

No network call has happened. The entire round trip — question to embedded vector to nearest chunks to formatted response — is your CPU talking to your disk.

A side-by-side: doc-mcp vs. niu-library

Same problem, opposite solution. Both projects exist on this site because the comparison is the lesson.

doc-mcp Local-first niu-library Cloud-first

RUNTIME Python process on your laptop AWS Lambda

EMBEDDINGS Ollama · mxbai-embed-large Voyage AI · voyage-4

VECTOR STORE ChromaDB · SQLite file Pinecone · hybrid index

RETRIEVAL One stage · cosine top-k Two stage · hybrid + rerank

TRANSPORT MCP over stdio · no network MCP over HTTPS · OAuth/JWT

COST $0 — electricity only ~$1-2/mo at this scale

The trade-offs cluster into three categories.

What you get from local-first.

  • Privacy by construction — the document has no path off your machine
  • Zero marginal cost — every additional document and query after the first download is free
  • Works offline — on a plane, on an untrusted network, when the wifi is broken
  • Small threat model — no API key to rotate, no IAM role to misconfigure

What you give up.

Search quality is the big one. niu-library runs hybrid dense-plus-sparse retrieval followed by a cross-encoder reranker that reads each candidate alongside the query. That’s the difference between “this passage shares vocabulary with the question” and “this passage actually answers the question.” doc-mcp does one-stage cosine retrieval — small difference for most personal documents, visible for a large research corpus with subtle queries.

You also give up structure-aware chunking and metadata enrichment: Haiku reading the first few pages to tag a document by domain, Semantic Scholar lookups that fill in author and year. Local-side, you have a filename. That’s it.

What you give up that you might not notice. No multi-user access. No automatic backups — disk failure means your embeddings go with it. No query monitoring. None of these matter for a solo laptop tool; all of them matter the moment someone else uses the system.

The local-first decision rule. Would you upload these documents to a cloud bucket today? If yes, the cloud version is worth it for better search quality. If no, local is the right shape. Most document collections fall on both sides of that line — which is why both projects exist.

The simplest MCP server pattern

doc-mcp is a useful object lesson in how minimal an MCP server can be.

outlook-mcp and niu-library run as long-lived HTTPS servers behind OAuth, deployed to Railway or AWS, accepting connections from many clients. That’s the right shape for shared services.

doc-mcp runs as a stdio server. No port. No listening socket. You configure Claude Desktop with a command — python -m doc_mcp.server — and a working directory. When Claude needs to call a tool, it spawns that command as a child process, talks to it over stdin/stdout using newline-delimited JSON-RPC, and shuts it down when the conversation ends.

stdio transport: no auth by design. No network means no auth — if you can run the process on the machine, you can use the tool. No port means no firewall conversation, no certificate, no domain. The threat surface is whatever your laptop’s threat surface already is. For a local tool that searches local documents, that’s exactly the right level. If you ever wondered what the minimum-viable MCP server looks like, doc-mcp is close: two tools, stdio transport, no auth, one Python entrypoint file.

The stack, annotated

LayerToolWhat it doesWhere it runs
RuntimePythonHosts the MCP server processYour laptop
TransportMCP over stdioJSON-RPC over stdin/stdout — no networkProcess pipes
EmbeddingsOllama + mxbai-embed-largeText → 1,024-number vector, 670 MB modelYour CPU/GPU
Vector storeChromaDBStores and searches vectors, persists to SQLiteA file on disk
File parsingpypdf, python-docx, EbookLibPDF, DOCX, TXT, EPUB → plain textYour laptop
Change trackingingested_files.txtSkip already-embedded documents on re-runA text file on disk
Cost$0No API calls, no cloud bills
AuthNone neededIf you can run the process, you can use the tool

Eight rows. None of them touch the internet at query time.

Known rough edges

Two issues show up in real use.

Scanned PDFs. If a document has no text layer — older academic scans, photographs of pages, anything from a copy machine — pypdf returns nothing and the document silently fails to ingest. The fix is ocrmypdf, which wraps Tesseract and adds a text layer. The planned upgrade detects missing text layers automatically; until it lands, scanned documents need manual OCR before being dropped into documents/.

Malformed PDF text layers. A small number of PDFs have technically present but mangled text — broken Unicode, embedded fonts that decode to garbage, characters the BERT-style tokenizer chokes on. The current behavior is a stack trace and a skipped document. If a document fails to ingest, first check whether its text extracts cleanly to a .txt file outside the pipeline. Better error handling and a stricter text cleaner fallback are on the list.

Both are reminders of a broader point: local-first means you own the rough edges. There is no cloud team running OCR for you. Avoiding the cloud means taking on the corner cases yourself.

Why build this yourself?

You could use a hosted note-search product. You could also Ctrl+F through a folder of PDFs like it’s 2005.

The reason to build doc-mcp is that it’s the cheapest way to understand how local AI tooling actually works. Every layer is readable: a few hundred lines of Python, a model file on your disk, a SQLite database you can open in any browser, an MCP server you can ps and kill. No hosted service obscures the seams. When something doesn’t work, the answer is in code you wrote or libraries small enough to read.

It’s also the most honest version of the privacy claim. Most “private AI” products mean “we promise we don’t look at your data.” doc-mcp means “the data has no path off your machine.” Those are not the same claim. The first depends on a vendor’s policy. The second depends on physics.

Next to niu-library, it’s a teaching object. Two projects, same question, opposite answers. Reading them together is the fastest way to develop a feel for the local-first / cloud-first trade-off — one of the load-bearing architectural decisions for any AI-native tool you’re likely to build next.

Build the minimum-viable version yourself, in code small enough to fit in your head. Then you know what the layers do, what the trade-offs cost, and which version you actually want.