Project write-up

Semantic Search Engine

A search system trained to retrieve by meaning instead of exact keyword matches.

Researched + Trained

12 May'25

I built this project to answer a simple frustration: keyword search works well only when the query uses the same language as the document. The moment the wording shifts, results become noisy, brittle, or flat-out wrong.

I wanted something closer to intent matching. If a note talks about “vector representations” and the query asks about “embeddings”, the system should still understand that these ideas are adjacent enough to retrieve the right passage.

What I was building

The project was a small semantic retrieval system with three parts:

a document preparation pipeline
an embedding-based candidate retrieval step
a ranking pass to improve the final ordering

The goal was not to compete with a production search engine. It was to understand the retrieval stack end to end and build something I could reason about, evaluate, and improve without hiding behind abstraction.

Why semantic search

Traditional search is excellent at precision when the vocabulary overlaps exactly. It struggles when meaning is shared but phrasing is different.

A few cases kept showing up:

queries asked in natural language while documents used technical shorthand
long notes where the relevant paragraph was buried under related but broader sections
concept-level searches where synonyms mattered more than literal word match

Semantic search is useful here because it represents both queries and documents in the same vector space. Instead of asking “do these words overlap,” it asks “do these pieces of text mean something similar.”

Document preparation

The first thing I learned was that retrieval quality depends heavily on chunking. Bad chunks poison the whole system.

At first I indexed full documents. That made retrieval too coarse. A single match on a broad article would outrank a much more focused note because the article contained many related terms. The engine was technically finding the right document, but not the right answer surface.

So I moved to chunk-level indexing:

long documents were split into smaller overlapping windows
headings were preserved as metadata
chunk boundaries tried to respect paragraph structure instead of cutting blindly by character count

That small change improved usefulness more than the early model experiments. Search quality felt better because the engine was now allowed to return the exact section that matched the query instead of a bulky parent document.

Embeddings and retrieval

For retrieval, both documents and queries were embedded into dense vectors. Similarity was then measured with nearest-neighbor search.

The basic pipeline looked like this:

clean and chunk documents
generate embeddings for each chunk
store vectors with chunk metadata
embed the incoming query
retrieve top-k nearest chunks
pass those candidates through a ranking layer

The first version stopped at step five. It was enough to demonstrate the core idea, but it still had a common failure mode: results were semantically related, yet not always ranked in the order a person would expect.

That is the gap between “roughly relevant” and “actually useful.”

Ranking pipeline

The ranking stage was the real turning point.

Dense retrieval is good at recalling relevant candidates, but the top few positions can still be messy. I found that queries with broad conceptual wording often returned chunks that were on-topic without being the best answer.

So I treated embeddings as a recall layer, not the final decision-maker.

The ranking pass used extra signals such as:

semantic similarity score
lexical overlap on important terms
heading relevance
chunk length penalties for overly broad matches

The combined effect was more stable ordering. Instead of one very broad chunk dominating the top spot, the ranking layer promoted sections that were both semantically aligned and locally specific.

That made the results feel less magical and more trustworthy.

Evaluation

I did not want to rely only on anecdotal demos, so I assembled a small set of test queries and judged the returned results manually.

The evaluation questions were simple:

does the correct chunk appear in the top-k results?
how often is the best answer in the first few positions?
what kinds of queries consistently confuse the system?

I kept notes on failure patterns instead of obsessing over a single metric. For a project like this, error taxonomy taught me more than headline numbers.

The recurring failure cases were:

vague one-word queries with weak context
queries that mixed multiple intents
highly domain-specific language missing from the embedding model’s prior exposure

The ranking layer improved precision near the top, but evaluation also showed that no amount of reranking can fix poor chunking or weak source material.

What failed first

Several early assumptions turned out to be wrong.

Full-document indexing

Too blunt. It retrieved relevant sources but not the best passages.

Pure vector similarity

Good for recall, weak for final ordering. The system often found the neighborhood of the answer without putting the answer itself first.

Aggressive chunk splitting

Chunks that were too small lost context. The model would match fragments well, but users do not read fragments in isolation. There is a lower bound where precision rises while usefulness falls.

Treating evaluation as optional

Without a fixed query set, every improvement looked convincing. Once I started replaying the same queries, many “wins” turned out to be luck.

What I learned

The project changed how I think about search systems.

The hardest part is not generating vectors. It is deciding what unit of information should be searchable, what trade-offs matter for the user, and how to measure whether the result is genuinely better.

I also came away with a useful mental model:

embeddings are strong at recall
ranking is where product quality becomes visible
chunking is the hidden lever that affects everything downstream

In other words, retrieval quality is mostly systems design, not just model selection.

If I continue this project

The next improvements would be practical rather than flashy:

build a cleaner labeled evaluation set
compare multiple embedding models on the same queries
add lightweight filters for document type or date
generate search result explanations to make ranking decisions easier to inspect

I would also like to package it as a small local-first search tool for personal notes, essays, and references. That feels like the right scale for this kind of system: useful, inspectable, and fast enough to run without ceremony.

Closing note

This project was less about shipping a polished search product and more about understanding why modern retrieval systems behave the way they do.

It gave me a much better appreciation for the boundary between “the model understands the query” and “the user actually got the answer they wanted.” That boundary is where most of the interesting engineering lives.

Back to projects Return home