Peter Niu
Projects / AI Agent Infrastructure

Library

A personal research library searchable by AI — and a worked example of what makes semantic search actually good.

active
Updated May 20, 2026
15 min read

I have a folder of research papers and textbooks I’ve collected over years of working in EdTech — cognitive theory, measurement sciences, learning sciences, statistics, UX/LX research. niu-library makes that collection searchable by AI. I ask a question in natural language, and Claude reads the passages that actually answer it before responding.

When an enterprise team asks “should we use Vectara or build on Pinecone?” the answer depends on chunking strategy, reranking quality, and hybrid search trade-offs that are invisible from a vendor demo. Building niu-library is how I learned to see them — and the mental model transfers directly to evaluating any RAG platform.

This walkthrough follows one question — “What does the literature say about worked examples in math education?” — from query to answer, through every step in between.

TL;DR

  • Why keyword search fails for research libraries — and what replaces it
  • Two-stage retrieval: why one pass of vector search isn’t enough
  • Parent-child chunking: search small units, read big sections
  • The full stack from question to answer, annotated

The problem: keyword search is the wrong tool

Remember Ctrl+F inside a PDF? You remember the idea but not the exact words. You search for “worked examples” and miss the chapter that calls them “fully worked-out solutions.” You search for “cognitive load” and get fifty hits in a reference list before you find a paragraph of actual content.

The traditional fix is full-text indexing — basically Google for your PDFs. Better than Ctrl+F, but it still searches for words. It can’t tell you that “the expertise reversal effect” and “cognitive load increases with prior knowledge” are talking about the same thing.

What you want is to search by meaning. Ask “worked examples in math,” and the system surfaces passages about Sweller, about example-problem pairs, about faded scaffolding — whether or not they use your exact phrasing.

The building blocks

Every question travels through several systems before an answer comes back. Here’s the cast:

ComponentRoleWhy this one
AWS Lambda + API GatewayServerless runtime and front doorNo servers to manage; free tier covers all usage
S3PDF storageUpload triggers the ingestion pipeline automatically
PineconeVector databaseMatches by meaning, not keywords; supports hybrid dense+sparse search
Voyage AI voyage-4Embedding modelText → 1,024-number vectors representing meaning
Voyage AI rerank-2.5Cross-encoder rerankerReads each candidate with the query for precise scoring
pymupdf4llmPDF → structured markdownPreserves headers, lists, paragraphs — splits at natural seams
Claude HaikuMetadata extractionReads first pages to extract title, author, year, domain
MCPIntegration protocolAny AI tool connects without custom wiring

How a question becomes an answer

Here’s what happens when I ask Claude about worked examples.

"worked examples in math education" Embed the query Question → 1,024 numbers representing its meaning VOYAGE AI Search papers Hybrid dense + sparse Top 40 candidates PINECONE Search books Hybrid dense + sparse Top 40 candidates PINECONE in parallel Pool 80 candidates A wide net — some great, some mediocre, some off-topic Rerank Cross-encoder reads each passage with the query — keeps 8 VOYAGE RERANK Top 8 passages Each with a parent_id for full-section expansion Optional: retrieve_parent expands a hit into its full section

Steps with a blue stripe are AI-powered. Dashed boxes are inputs and outputs.

Embedding the question

The question goes to Voyage’s voyage-4 model, which returns a vector of 1,024 numbers — same model, same dimensions as the chunks already sitting in Pinecone.

Asymmetric embeddings. niu-library calls Voyage with input_type="query"; ingestion used input_type="document". Modern embedding models produce slightly different vectors for the same text depending on whether it’s being asked or answered. Asking and answering live in slightly different spaces.

Two namespaces in parallel

The library is split into two Pinecone namespaces: papers (research articles, under 50 pages) and books (textbooks and longer texts). Each uses its own chunk size — 512 tokens for papers, 1,024 for books — because a paper’s argument unfolds tightly while a textbook chapter has more room to breathe.

Both namespaces get searched in parallel, each returning 40 best candidates. The score is a hybrid of two signals:

  • Dense vectors — the meaning-based match described above.
  • Sparse vectors — Pinecone’s pinecone-sparse-english-v0 model, closer to classical keyword matching but smarter. Catches cases where the exact phrase matters, like “Sweller (1988)” or “split-attention effect.”

Dense catches meaning. Sparse catches specificity. Hybrid scoring uses both.

A noise pool of 80 candidates

At this point the system has 80 candidate passages. Some are excellent. Some are tangentially related. Some are off-topic — a software docs page that uses the phrase “worked example,” or a textbook section on math anxiety that mentions worked problems in passing.

This is where early naive RAG fails. Take the top 8 by cosine similarity, hand them to the LLM, and hope. The LLM cites whatever it receives — even passages with nothing useful to say about the question.

The core problem with single-stage RAG. Embedding similarity is a rough approximation of relevance, not a precise one. Two passages can land near your query for completely different reasons. That’s what the next step fixes.

Two-stage retrieval

This is the single pattern that separates RAG systems that work from those that don’t. Almost every production system in early 2026 uses it.

Why one stage isn’t enough

When you embed a query and find the nearest chunks in vector space, you’re using a bi-encoder. The query and each passage were embedded separately, ahead of time. The comparison is just a distance calculation. Fast. Cheap. Also imprecise.

The imprecision is structural. A 1,024-number vector compresses an entire passage into a single point. Two passages can land near your query for different reasons — one because it answers your question, another because it shares vocabulary. The bi-encoder has no way to tell those apart. It never reads the query and the passage together.

A cross-encoder does. You hand it the query and one passage as a single input. It reads them jointly — attending to how each word in the query relates to each word in the passage — and outputs a single relevance score. Far more accurate. Far more expensive. You can’t run it across millions of chunks. But 80 candidates? Easy.

  • Stage 1 (broad, cheap): Bi-encoder embeddings cast a wide net — a pool of plausible matches. Guarantees recall.
  • Stage 2 (narrow, precise): Cross-encoder reranker reads each candidate alongside the query. Delivers precision.

What the quality jump looks like

Each box below is a passage. Outlined boxes are actually relevant; muted ones are near-misses the embedding pulled in for surface reasons.

Before reranking Top 8 by cosine similarity After reranking Top 8 by cross-encoder score Sweller 1988 — worked examples math anxiety (mentions "example") example-problem pairs software docs: "worked-out solution" faded scaffolding in algebra history of math curricula cognitive load (general overview) expertise reversal effect 4 relevant / 4 noise rerank Sweller 1988 — worked examples example-problem pairs faded scaffolding in algebra expertise reversal effect worked examples in geometry Renkl on self-explanation comparing examples to problem-solving cognitive load (general overview) 7 relevant / 1 noise

The reranker doesn’t invent better passages — they were already in the pool of 80. It reorders. The math anxiety chapter got pulled in by surface similarity (“example” appears in the text); the reranker, having actually read it alongside the query, demotes it. Passages ranked twentieth or fortieth by cosine — but that genuinely answer the question — rise to the top.

The difference. RAG that hallucinates plausibly vs. RAG that quotes the right book comes down to whether you rerank.

Why the first stage still matters

If reranking is better, why not rerank everything? Cost. The cross-encoder reads each passage with the query — running it against 1,500 chunks would take seconds and burn budget. The bi-encoder embeds everything once at ingestion time; at query time it’s one embedding and a fast geometric lookup.

Cheap-and-broad first, expensive-and-precise second. The first stage’s job isn’t to be right — it’s to be inclusive enough that the right answer is in its pool.

Why chunks have parents

A search hit is a small chunk — 512 or 1,024 tokens, maybe half a page. Enough to score well on relevance, but often not enough to answer. The paragraph saying “worked examples reduce extraneous load” is the right find. The two paragraphs around it — the experimental setup, the implications — are what you actually need to read.

niu-library solves this with a parent-child hierarchy. Each PDF splits into parent chunks (2–4 KB sections, header-aligned) and child chunks (smaller units inside them). Only the children get embedded and searched. Each child carries a parent_id pointing back to its surrounding section.

Child chunk What got searched and matched Worked examples reduce extraneous cognitive load by providing complete solutions for learners to study before practice. ~500 tokens High precision, low context retrieve_parent Parent section What Claude actually reads Before: setup of the cognitive load framework, expertise levels, the problem-solving baseline... Worked examples reduce extraneous cognitive load by providing complete solutions for learners to study before practice. After: experimental conditions, measured outcomes, faded scaffolding implications, comparison to direct practice. ~2-4 KB Same precision, full context

Search small, read big. The search index holds small, focused units — that’s what gives embeddings their precision. But the LLM reads the larger surrounding section — that’s what makes answers trustworthy. Parent-child chunking gives you both.

In practice, search_docs returns 8 high-precision hits with parent_id references. A follow-up retrieve_parent call expands any of them into the full surrounding section. Claude picks which ones to expand based on what it needs.

What’s inside the corpus

The library currently holds 44 documents and roughly 1,500 vectors — papers and textbooks across six domains (AI/ML, learning design, measurement, education, statistics, other). Domain tags come from Haiku classification at ingestion time, with an S3 content fallback for documents whose metadata is too thin. You can filter searches by domain to scope your question.

The ingestion pipeline runs automatically when a PDF lands in S3:

  1. Classify — paper or book, by page count
  2. Extract — structured markdown via pymupdf4llm; strip references and indexes that would poison search
  3. Enrich — metadata via Haiku + Semantic Scholar lookups
  4. Embed — dense (Voyage voyage-4) + sparse (Pinecone’s model)
  5. Upload — to the matching namespace

Four Lambda functions handle the whole flow — MCP server, OAuth, authorizer, vectorizer — all deployed by SAM.

The stack, annotated

LayerServiceWhat it doesFree tier?
RuntimeAWS Lambda + API GatewayRuns code on demand, no servers to manageYes (1M requests/mo)
StorageS3Holds source PDFs; triggers vectorize on uploadYes (5 GB)
InterfaceMCPOne protocol any AI tool can plug intoOpen standard
AuthLambda Authorizer + DynamoDBJWT (OAuth 2.1) or x-api-key — resolves identityIncluded
Extractionpymupdf4llmPDF → structured markdown, preserves headersOpen source
MetadataClaude HaikuReads first pages, returns title/author/year/domainPay-per-use
Dense embeddingVoyage voyage-4Text → 1,024 numbers representing meaningFree tier available
Sparse embeddingPinecone sparse-english-v0Smart keyword signal for hybrid searchIncluded with Pinecone
Vector storePineconeStores and searches by meaningYes (100K vectors)
RerankingVoyage rerank-2.5Cross-encoder relevance scoring on top candidatesFree tier available
DeployAWS SAMOne command to package and ship all four LambdasFree (tooling)

Why build this yourself?

You can pay for a hosted RAG product. Some are good. But the seams are hidden — chunking strategy, embedding choice, hybrid scoring, two-stage retrieval, parent expansion — all wrapped in a box that returns “the answer.”

Build it yourself and the seams are visible. Swap the reranker, change the chunk size, try a different embedding model, and watch the quality change. You learn what good even means.

That’s the ethos behind every project on this site: build the real thing, then show how every layer works so anyone can follow along.