Library — Peter Niu

A personal research library searchable by AI — and a worked example of what makes semantic search actually good.

I have a folder of research papers and textbooks I’ve collected over years of working in EdTech — cognitive theory, measurement sciences, learning sciences, statistics, UX/LX research. niu-library makes that collection searchable by AI. I ask a question in natural language, and Claude reads the passages that actually answer it before responding.

When an enterprise team asks “should we use Vectara or build on Pinecone?” the answer depends on chunking strategy, reranking quality, and hybrid search trade-offs that are invisible from a vendor demo. Building niu-library is how I learned to see them — and the mental model transfers directly to evaluating any RAG platform.

This walkthrough follows one question — “What does the literature say about worked examples in math education?” — from query to answer, through every step in between.

TL;DR

Why keyword search fails for research libraries — and what replaces it
Two-stage retrieval: why one pass of vector search isn’t enough
Parent-child chunking: search small units, read big sections
The full stack from question to answer, annotated

The problem: keyword search is the wrong tool

Remember Ctrl+F inside a PDF? You remember the idea but not the exact words. You search for “worked examples” and miss the chapter that calls them “fully worked-out solutions.” You search for “cognitive load” and get fifty hits in a reference list before you find a paragraph of actual content.

The traditional fix is full-text indexing — basically Google for your PDFs. Better than Ctrl+F, but it still searches for words. It can’t tell you that “the expertise reversal effect” and “cognitive load increases with prior knowledge” are talking about the same thing.

What you want is to search by meaning. Ask “worked examples in math,” and the system surfaces passages about Sweller, about example-problem pairs, about faded scaffolding — whether or not they use your exact phrasing.

The building blocks

Every question travels through several systems before an answer comes back. Here’s the cast:

Component	Role	Why this one
AWS Lambda + API Gateway	Serverless runtime and front door	No servers to manage; free tier covers all usage
S3	PDF storage	Upload triggers the ingestion pipeline automatically
Pinecone	Vector database	Matches by meaning, not keywords; supports hybrid dense+sparse search
Voyage AI `voyage-4`	Embedding model	Text → 1,024-number vectors representing meaning
Voyage AI `rerank-2.5`	Cross-encoder reranker	Reads each candidate with the query for precise scoring
pymupdf4llm	PDF → structured markdown	Preserves headers, lists, paragraphs — splits at natural seams
Claude Haiku	Metadata extraction	Reads first pages to extract title, author, year, domain
MCP	Integration protocol	Any AI tool connects without custom wiring

How a question becomes an answer

Here’s what happens when I ask Claude about worked examples.

Steps with a blue stripe are AI-powered. Dashed boxes are inputs and outputs.

Embedding the question

The question goes to Voyage’s voyage-4 model, which returns a vector of 1,024 numbers — same model, same dimensions as the chunks already sitting in Pinecone.

Asymmetric embeddings. niu-library calls Voyage with input_type="query"; ingestion used input_type="document". Modern embedding models produce slightly different vectors for the same text depending on whether it’s being asked or answered. Asking and answering live in slightly different spaces.

Two namespaces in parallel

The library is split into two Pinecone namespaces: papers (research articles, under 50 pages) and books (textbooks and longer texts). Each uses its own chunk size — 512 tokens for papers, 1,024 for books — because a paper’s argument unfolds tightly while a textbook chapter has more room to breathe.

Both namespaces get searched in parallel, each returning 40 best candidates. The score is a hybrid of two signals:

Dense vectors — the meaning-based match described above.
Sparse vectors — Pinecone’s pinecone-sparse-english-v0 model, closer to classical keyword matching but smarter. Catches cases where the exact phrase matters, like “Sweller (1988)” or “split-attention effect.”

Dense catches meaning. Sparse catches specificity. Hybrid scoring uses both.

A noise pool of 80 candidates

At this point the system has 80 candidate passages. Some are excellent. Some are tangentially related. Some are off-topic — a software docs page that uses the phrase “worked example,” or a textbook section on math anxiety that mentions worked problems in passing.

This is where early naive RAG fails. Take the top 8 by cosine similarity, hand them to the LLM, and hope. The LLM cites whatever it receives — even passages with nothing useful to say about the question.

The core problem with single-stage RAG. Embedding similarity is a rough approximation of relevance, not a precise one. Two passages can land near your query for completely different reasons. That’s what the next step fixes.

Two-stage retrieval

This is the single pattern that separates RAG systems that work from those that don’t. Almost every production system in early 2026 uses it.

Why one stage isn’t enough

When you embed a query and find the nearest chunks in vector space, you’re using a bi-encoder. The query and each passage were embedded separately, ahead of time. The comparison is just a distance calculation. Fast. Cheap. Also imprecise.

The imprecision is structural. A 1,024-number vector compresses an entire passage into a single point. Two passages can land near your query for different reasons — one because it answers your question, another because it shares vocabulary. The bi-encoder has no way to tell those apart. It never reads the query and the passage together.

A cross-encoder does. You hand it the query and one passage as a single input. It reads them jointly — attending to how each word in the query relates to each word in the passage — and outputs a single relevance score. Far more accurate. Far more expensive. You can’t run it across millions of chunks. But 80 candidates? Easy.

Stage 1 (broad, cheap): Bi-encoder embeddings cast a wide net — a pool of plausible matches. Guarantees recall.
Stage 2 (narrow, precise): Cross-encoder reranker reads each candidate alongside the query. Delivers precision.

What the quality jump looks like

Each box below is a passage. Outlined boxes are actually relevant; muted ones are near-misses the embedding pulled in for surface reasons.

The reranker doesn’t invent better passages — they were already in the pool of 80. It reorders. The math anxiety chapter got pulled in by surface similarity (“example” appears in the text); the reranker, having actually read it alongside the query, demotes it. Passages ranked twentieth or fortieth by cosine — but that genuinely answer the question — rise to the top.

The difference. RAG that hallucinates plausibly vs. RAG that quotes the right book comes down to whether you rerank.

Why the first stage still matters

If reranking is better, why not rerank everything? Cost. The cross-encoder reads each passage with the query — running it against 1,500 chunks would take seconds and burn budget. The bi-encoder embeds everything once at ingestion time; at query time it’s one embedding and a fast geometric lookup.

Cheap-and-broad first, expensive-and-precise second. The first stage’s job isn’t to be right — it’s to be inclusive enough that the right answer is in its pool.

Why chunks have parents

A search hit is a small chunk — 512 or 1,024 tokens, maybe half a page. Enough to score well on relevance, but often not enough to answer. The paragraph saying “worked examples reduce extraneous load” is the right find. The two paragraphs around it — the experimental setup, the implications — are what you actually need to read.

niu-library solves this with a parent-child hierarchy. Each PDF splits into parent chunks (2–4 KB sections, header-aligned) and child chunks (smaller units inside them). Only the children get embedded and searched. Each child carries a parent_id pointing back to its surrounding section.

Search small, read big. The search index holds small, focused units — that’s what gives embeddings their precision. But the LLM reads the larger surrounding section — that’s what makes answers trustworthy. Parent-child chunking gives you both.

In practice, search_docs returns 8 high-precision hits with parent_id references. A follow-up retrieve_parent call expands any of them into the full surrounding section. Claude picks which ones to expand based on what it needs.

What’s inside the corpus

The library currently holds 44 documents and roughly 1,500 vectors — papers and textbooks across six domains (AI/ML, learning design, measurement, education, statistics, other). Domain tags come from Haiku classification at ingestion time, with an S3 content fallback for documents whose metadata is too thin. You can filter searches by domain to scope your question.

The ingestion pipeline runs automatically when a PDF lands in S3:

Classify — paper or book, by page count
Extract — structured markdown via pymupdf4llm; strip references and indexes that would poison search
Enrich — metadata via Haiku + Semantic Scholar lookups
Embed — dense (Voyage voyage-4) + sparse (Pinecone’s model)
Upload — to the matching namespace

Four Lambda functions handle the whole flow — MCP server, OAuth, authorizer, vectorizer — all deployed by SAM.

The stack, annotated

Layer	Service	What it does	Free tier?
Runtime	AWS Lambda + API Gateway	Runs code on demand, no servers to manage	Yes (1M requests/mo)
Storage	S3	Holds source PDFs; triggers vectorize on upload	Yes (5 GB)
Interface	MCP	One protocol any AI tool can plug into	Open standard
Auth	Lambda Authorizer + DynamoDB	JWT (OAuth 2.1) or x-api-key — resolves identity	Included
Extraction	pymupdf4llm	PDF → structured markdown, preserves headers	Open source
Metadata	Claude Haiku	Reads first pages, returns title/author/year/domain	Pay-per-use
Dense embedding	Voyage `voyage-4`	Text → 1,024 numbers representing meaning	Free tier available
Sparse embedding	Pinecone `sparse-english-v0`	Smart keyword signal for hybrid search	Included with Pinecone
Vector store	Pinecone	Stores and searches by meaning	Yes (100K vectors)
Reranking	Voyage `rerank-2.5`	Cross-encoder relevance scoring on top candidates	Free tier available
Deploy	AWS SAM	One command to package and ship all four Lambdas	Free (tooling)

Why build this yourself?

You can pay for a hosted RAG product. Some are good. But the seams are hidden — chunking strategy, embedding choice, hybrid scoring, two-stage retrieval, parent expansion — all wrapped in a box that returns “the answer.”

Build it yourself and the seams are visible. Swap the reranker, change the chunk size, try a different embedding model, and watch the quality change. You learn what good even means.

That’s the ethos behind every project on this site: build the real thing, then show how every layer works so anyone can follow along.