Natural Language Processing Systems • Semantic Search (Dense Embeddings, ANN)Medium⏱️ ~3 min
Production Architecture: Ingestion, Indexing, and Query Serving
A production semantic search system has four main paths: offline ingestion, index building, online retrieval, and relevance feedback. Understanding how these fit together and their scale characteristics is critical for system design interviews.
Offline ingestion starts with raw documents. Long documents are chunked into manageable units, typically 200 to 400 tokens per chunk with 50 token overlap to prevent context boundaries from hurting recall. Each chunk is embedded into a float vector using a trained encoder model. With distributed executors, you can achieve embedding throughput of 1 billion items per hour using around 300 commodity CPU workers. This sets expectations: a catalog of 500 million documents can be fully reprocessed in an overnight batch window. The output is stored with metadata like document ID, chunk offset, language, tenant ID, and access control tags for filtering.
Index building is where you choose and train your ANN structure. For IVF style indices, you run k means clustering on a sample (commonly 1 to 10 percent of the corpus, stratified by language and popularity) to learn K centroids, where K typically ranges from 4,096 to 65,536 depending on corpus size. Product Quantization then learns codebooks to compress each vector. For example, 200 million vectors at 768 dimensions in float32 would be 614 GB raw. With 32 byte PQ codes and moderate index overhead, the compressed index fits in roughly 10 to 20 GB per shard. Real systems report about 15 GB for 200 million vectors at roughly 1 over 100 compression ratio. HNSW does not require a training phase and supports online inserts well, but uses more memory.
Online retrieval begins when a user query arrives. The query is embedded (budget: 5 milliseconds on CPU with a small encoder), then the vector service performs ANN search. For 1 to 10 million vectors per shard on CPU, systems routinely achieve 2 to 10 milliseconds at p95 for the vector lookup. At 100 million scale per shard with IVF PQ or ScaNN, lookup stays under 20 milliseconds at p95 with 90 to 95 percent recall at 10. After ANN, a cross encoder reranker can rescore the top 50 to 200 candidates, adding 20 to 40 milliseconds but improving precision at 10 by 5 to 15 percent. The total end to end budget is typically 50 to 150 milliseconds at p95. A single node can handle a few thousand queries per second (QPS) if ANN is CPU bound.
Hybrid retrieval is widely deployed. Systems like Elasticsearch combine BM25 keyword search with HNSW dense retrieval in the same cluster. You can enforce must have keywords using a lexical prefilter, then run ANN on the filtered candidate set. Or retrieve top K from both BM25 and dense ANN, then blend scores using a learned linear model or isotonic regression on click data. Many companies report that blending lexical and semantic signals improves relevance over either alone. Microsoft Bing and Google both use hybrid retrieval with dense vectors plus reranking for web search and site search.
💡 Key Takeaways
•Offline ingestion can embed 1 billion items per hour with 300 CPU workers, enabling overnight refresh of 500 million document catalogs; chunk documents to 200 to 400 tokens with overlap
•Index build for IVF trains k means on 1 to 10 percent sample to learn 4,096 to 65,536 centroids; 200 million 768d vectors compress from 614 GB raw to 10 to 20 GB with PQ
•Online retrieval budget: 5 milliseconds query embedding, 10 to 20 milliseconds ANN at 100 million scale, 20 to 40 milliseconds cross encoder rerank, total 50 to 150 milliseconds p95
•Hybrid retrieval blends BM25 keyword and dense semantic scores, widely used at Google and Microsoft Bing; Elasticsearch supports HNSW plus keyword filters in single cluster
•Single node throughput: few thousand QPS for CPU bound ANN; GPU can handle tens of thousands QPS for batched brute force or reranking
📌 Examples
Google embeds billions of web documents overnight using distributed CPU clusters, with per shard indices of 100 to 200 million vectors using quantization
Elasticsearch deploys HNSW on tens of millions of documents per shard, achieving 5 to 10 millisecond p95 ANN latency and supporting hybrid queries with keyword must match filters
Production chunking: Split a 10,000 token document into 25 chunks of 400 tokens with 50 token overlap, embed each chunk separately, store chunk offsets for snippet extraction