BM25 Implementation: Inverted Index and Dynamic Pruning at Scale
From Formula to Production
BM25 is just a scoring formula. Making it work at scale over 100 million documents requires an inverted index, compression, and pruning algorithms that avoid scanning everything. The gap between textbook BM25 and production BM25 is where the engineering complexity lives.
The Inverted Index Foundation
Map each term to a postings list containing (document ID, term frequency) pairs. For 100 million documents averaging 200 tokens each, you have roughly 20 billion term occurrences. Without positional data, integer compression (variable byte or PForDelta) shrinks to 1 to 2 bytes per posting, yielding 20 to 40 GB for the core index. This is easily RAM resident on servers with 128 to 256 GB memory. The term dictionary (1 to 5 GB for 10 million unique terms) stays in RAM along with frequently accessed postings.
Query Execution at Scale
For query "cheap pizza near me", first look up each term in the dictionary to retrieve postings lists. Common term "cheap" might return 2 million document IDs; "pizza" returns 500,000. Compute intersection, score each candidate using BM25, maintain a min heap of top K scores (typically K equals 1000 for first retrieval stage). Naive implementation scanning all 2 million postings takes 200 to 400 ms; production systems finish in 15 to 25 ms through aggressive pruning.
Dynamic Pruning with WAND
The key insight: you want top K results, not all results. Track the K-th best score as threshold (initially 0, quickly rising to perhaps 8.5 after first few hundred documents). For each candidate, compute maximum possible score assuming all query terms match with max TF. If max possible (say 7.2) is below threshold (8.5), skip without exact scoring. Result: 2 million postings become 50,000 actually scored. WAND (Weak AND) processes postings in document ID order, maintaining pivot pointers that skip entire blocks. BlockMax WAND stores per block maximum scores, enabling 90 to 95 percent of postings to be skipped.
Memory Hierarchy and Sharding
Cold postings live on NVMe SSDs with block compression; 4 KB block read takes 50 to 100 microseconds. If index exceeds RAM, shard across machines. Elasticsearch defaults to 5 primary shards queried in parallel with results merged by coordinator. Target: 80 percent of p99 latency should be CPU bound scoring, not IO waiting. For 100 million document index, expect 15 to 30 ms p50 latency with proper tuning.