Search & Ranking Systems • Ranking Algorithms (TF-IDF, BM25)Hard⏱️ ~3 min
BM25 Implementation: Inverted Index and Dynamic Pruning at Scale
Building a production BM25 system starts with the inverted index: a dictionary mapping terms to postings lists, where each posting contains document ID and term frequency, optionally with positions for phrase queries. For 100 million documents averaging 200 tokens each, you have roughly 20 billion term occurrences. Without positions, integer compression (variable byte or block based schemes like PForDelta) compresses postings to 1 to 3 bytes per occurrence, yielding 20 to 60 GB per shard. With positions for phrase support, expect 3 to 10 bytes per occurrence, reaching 60 to 200 GB. This fits comfortably in RAM on modern servers (256 to 512 GB), enabling sub 10ms retrieval.
Dynamic pruning is essential for low latency. Weak AND (WAND) maintains a threshold score (initially the kth highest score seen so far) and for each query term, tracks an upper bound on the maximum contribution any remaining document can make from that term. During traversal, if the sum of upper bounds for all remaining terms cannot beat the threshold, you skip entire postings segments. Block Max WAND extends this by precomputing per block (typically 128 to 256 documents) the maximum term frequency and storing it alongside the block pointer. This allows skipping not just terms but entire blocks, reducing the number of score evaluations from millions to thousands.
Sharding and replication balance load and availability. A common pattern: 100 million documents split across 20 shards of 5 million each, replicated 3× for 60 total index instances. Each query fans out to all 20 shards in parallel (one replica per shard chosen by load balancer), retrieves top 50 candidates per shard in 8 to 15ms, then a central aggregator merges 1,000 candidates (20 shards × 50) in under 2ms. Tail latency (p99) is determined by the slowest shard; hedging (send duplicate request after 30ms) or speculative execution mitigates stragglers.
Keeping indexes fresh requires incremental updates. Append only segment architecture (used by Lucene, Elasticsearch) writes new documents to in memory buffers flushed to immutable segments every few seconds, with background merges consolidating small segments. This keeps write latency low (1 to 5ms per document) while maintaining read consistency. IDF statistics lag by design: recompute global stats every hour or day and push atomically to all shards to maintain ranking consistency. For real time applications requiring sub second freshness, maintain a small hot index (last hour's documents) searched alongside the main index, accepting slightly higher latency (2 to 5ms overhead).
💡 Key Takeaways
•Block Max WAND reduces score evaluations by 100× to 1000×: naive scoring of 10M documents takes 500ms; WAND evaluates ~50K candidates in 12ms; Block Max WAND evaluates ~5K candidates in 5ms by skipping entire 256 document blocks using precomputed max scores
•Memory vs disk tradeoff: in memory postings give 5 to 10ms p95 retrieval but cost $200 to $400 per server (512 GB RAM); disk based with SSD gives 15 to 30ms p95 with 10× cheaper hardware ($40 per server); hybrid (hot terms in RAM, cold on SSD) balances cost and latency
•Compression impact: uncompressed postings with 4 byte docIDs and 4 byte frequencies use 8 bytes per posting (160 GB for 20B postings); variable byte encoding reduces to 2 to 3 bytes (40 to 60 GB); block based PForDelta achieves 1.5 to 2 bytes (30 to 40 GB) at cost of 10 to 20 percent decompression overhead
•Freshness vs consistency: appending documents to index is cheap (1ms per doc) but changes IDF immediately on that shard only, causing ranking divergence across replicas; delaying IDF updates by 1 hour keeps consistency but means new trending terms rank poorly for an hour
•Hedging reduces p99 latency by 30 to 50 percent: if shard replica A doesn't respond in 30ms, send duplicate to replica B; 95 percent of queries complete with primary, 4 percent need hedge, 1 percent need both; cost is 5 percent extra load but cuts p99 from 80ms to 45ms
📌 Examples
Elasticsearch at 100M documents: 20 shards × 3 replicas on 20 physical nodes (each hosts 3 shards from different logical partitions); 64 GB RAM per node; postings compressed with LZ4 to 35 GB per shard; query p95 of 12ms, p99 of 28ms at 5,000 QPS per cluster
Custom inverted index in C++ for latency critical RAG: 10M passages in memory using PForDelta compression (15 GB postings); single threaded WAND search for 10 term query evaluates 8K candidates, retrieves top 200 in 4ms p95 on single core; scales to 50K QPS with 24 core server
Real time Twitter search: main index covers tweets older than 5 minutes (billions of tweets, 500ms p95); hot index covers last 5 minutes (10M tweets, 50ms p95); query hits both in parallel, merges results; users see tweets within 10 seconds of posting while maintaining low latency
Wikipedia search index (6M articles, 3.5B tokens): single shard fits in 48 GB RAM (compressed postings 28 GB, dictionaries 4 GB, doc metadata 12 GB, OS cache 4 GB); WAND with k=100 evaluates 2K to 15K candidates depending on query specificity, p95 latency 8ms, p99 18ms on commodity CPU