Natural Language Processing Systems • Semantic Search (Dense Embeddings, ANN)Easy⏱️ ~3 min
What is Semantic Search and How Do Dense Embeddings Work?
Semantic search matches content based on meaning rather than exact keyword overlap. It solves a fundamental problem: a user searching for "fix flaky tests in CI" should find documents about "reduce nondeterministic build failures" even though they share no common words. This works by converting both queries and documents into dense embeddings, which are numerical vectors in a continuous multidimensional space.
An embedding is simply a list of numbers, typically between 128 and 1024 dimensions, though production systems commonly use 256, 384, or 768 dimensions. Think of each dimension as a learned feature that captures some aspect of meaning. Items with similar semantic content end up positioned near each other in this vector space. For example, Google might represent "machine learning" as a 384 dimensional vector like [0.23, negative 0.15, 0.87, ...], and "artificial intelligence" would have a very similar vector because the concepts are related.
To measure similarity, systems use distance metrics. Cosine similarity is the most common choice, measuring the angle between two vectors regardless of their magnitude. Inner product (dot product) is also widely used, especially when vectors are normalized to unit length, making it equivalent to cosine. L2 distance (Euclidean distance) measures straight line distance and is used in algorithms that assume geometric properties. A critical implementation detail: if you use cosine similarity, normalize all vectors to unit length during indexing, or use an index that supports cosine natively.
Embedding quality is the single biggest factor in retrieval quality. You can start with general purpose sentence encoders as a baseline, but domain specific fine tuning typically improves relevance metrics like Normalized Discounted Cumulative Gain (NDCG) or Mean Reciprocal Rank (MRR) by 5 to 20 percent. Companies like Google and Meta invest heavily in training domain specific embedding models on billions of examples from search logs and user interactions.
💡 Key Takeaways
•Dense embeddings convert text into numerical vectors (typically 256 to 768 dimensions) where semantically similar items are positioned close together in vector space
•Cosine similarity is the most common distance metric in production, measuring angle between vectors; normalize to unit length or use cosine aware indices
•Embedding quality matters more than algorithm choice: domain fine tuning improves NDCG by 5 to 20 percent over general sentence encoders
•Distance thresholds act as quality gates: if nearest neighbor distance exceeds threshold, return no results or fall back to keyword search to avoid irrelevant matches
📌 Examples
Google uses 384 dimensional embeddings for semantic search, enabling queries like "CEO of Tesla" to match documents containing "Elon Musk" without shared keywords
Meta trains embedding models on billions of user interactions, with 768 dimensional vectors for content understanding across Facebook and Instagram feeds
Production normalization: Store vectors as unit length during indexing by dividing each vector by its L2 norm, making cosine similarity equivalent to faster dot product