Project write-up
Semantic Search Engine
A search system trained to retrieve by meaning instead of exact keyword matches.
I built this project to answer a simple frustration: keyword search works well only when the query uses the same language as the document. The moment the wording shifts, results become noisy, brittle, or flat-out wrong.
I wanted something closer to intent matching. If a note talks about “vector representations” and the query asks about “embeddings”, the system should still understand that these ideas are adjacent enough to retrieve the right passage.
What I was building
The project was a small semantic retrieval system with three parts:
- a document preparation pipeline
- an embedding-based candidate retrieval step
- a ranking pass to improve the final ordering
The goal was not to compete with a production search engine. It was to understand the retrieval stack end to end and build something I could reason about, evaluate, and improve without hiding behind abstraction.
Why semantic search
Traditional search is excellent at precision when the vocabulary overlaps exactly. It struggles when meaning is shared but phrasing is different.
A few cases kept showing up:
- queries asked in natural language while documents used technical shorthand
- long notes where the relevant paragraph was buried under related but broader sections
- concept-level searches where synonyms mattered more than literal word match
Semantic search is useful here because it represents both queries and documents in the same vector space. Instead of asking “do these words overlap,” it asks “do these pieces of text mean something similar.”
Document preparation
The first thing I learned was that retrieval quality depends heavily on chunking. Bad chunks poison the whole system.
At first I indexed full documents. That made retrieval too coarse. A single match on a broad article would outrank a much more focused note because the article contained many related terms. The engine was technically finding the right document, but not the right answer surface.
So I moved to chunk-level indexing:
- long documents were split into smaller overlapping windows
- headings were preserved as metadata
- chunk boundaries tried to respect paragraph structure instead of cutting blindly by character count
That small change improved usefulness more than the early model experiments. Search quality felt better because the engine was now allowed to return the exact section that matched the query instead of a bulky parent document.
Embeddings and retrieval
For retrieval, both documents and queries were embedded into dense vectors. Similarity was then measured with nearest-neighbor search.
The basic pipeline looked like this:
- clean and chunk documents
- generate embeddings for each chunk
- store vectors with chunk metadata
- embed the incoming query
- retrieve top-k nearest chunks
- pass those candidates through a ranking layer
The first version stopped at step five. It was enough to demonstrate the core idea, but it still had a common failure mode: results were semantically related, yet not always ranked in the order a person would expect.
That is the gap between “roughly relevant” and “actually useful.”
Ranking pipeline
The ranking stage was the real turning point.
Dense retrieval is good at recalling relevant candidates, but the top few positions can still be messy. I found that queries with broad conceptual wording often returned chunks that were on-topic without being the best answer.
So I treated embeddings as a recall layer, not the final decision-maker.
The ranking pass used extra signals such as:
- semantic similarity score
- lexical overlap on important terms
- heading relevance
- chunk length penalties for overly broad matches
The combined effect was more stable ordering. Instead of one very broad chunk dominating the top spot, the ranking layer promoted sections that were both semantically aligned and locally specific.
That made the results feel less magical and more trustworthy.
Evaluation
I did not want to rely only on anecdotal demos, so I assembled a small set of test queries and judged the returned results manually.
The evaluation questions were simple:
- does the correct chunk appear in the top-k results?
- how often is the best answer in the first few positions?
- what kinds of queries consistently confuse the system?
I kept notes on failure patterns instead of obsessing over a single metric. For a project like this, error taxonomy taught me more than headline numbers.
The recurring failure cases were:
- vague one-word queries with weak context
- queries that mixed multiple intents
- highly domain-specific language missing from the embedding model’s prior exposure
The ranking layer improved precision near the top, but evaluation also showed that no amount of reranking can fix poor chunking or weak source material.
What failed first
Several early assumptions turned out to be wrong.
Full-document indexing
Too blunt. It retrieved relevant sources but not the best passages.
Pure vector similarity
Good for recall, weak for final ordering. The system often found the neighborhood of the answer without putting the answer itself first.
Aggressive chunk splitting
Chunks that were too small lost context. The model would match fragments well, but users do not read fragments in isolation. There is a lower bound where precision rises while usefulness falls.
Treating evaluation as optional
Without a fixed query set, every improvement looked convincing. Once I started replaying the same queries, many “wins” turned out to be luck.
What I learned
The project changed how I think about search systems.
The hardest part is not generating vectors. It is deciding what unit of information should be searchable, what trade-offs matter for the user, and how to measure whether the result is genuinely better.
I also came away with a useful mental model:
- embeddings are strong at recall
- ranking is where product quality becomes visible
- chunking is the hidden lever that affects everything downstream
In other words, retrieval quality is mostly systems design, not just model selection.
If I continue this project
The next improvements would be practical rather than flashy:
- build a cleaner labeled evaluation set
- compare multiple embedding models on the same queries
- add lightweight filters for document type or date
- generate search result explanations to make ranking decisions easier to inspect
I would also like to package it as a small local-first search tool for personal notes, essays, and references. That feels like the right scale for this kind of system: useful, inspectable, and fast enough to run without ceremony.
Closing note
This project was less about shipping a polished search product and more about understanding why modern retrieval systems behave the way they do.
It gave me a much better appreciation for the boundary between “the model understands the query” and “the user actually got the answer they wanted.” That boundary is where most of the interesting engineering lives.