Embeddings & Similarity SearchReal-time Updates (Incremental Indexing)Medium⏱️ ~3 min

Hot Index Plus Main Index Architecture

The hot plus main index pattern splits your data into a small, fast moving hot index and a large, stable main index. The hot index holds recent items, typically the last 24 to 72 hours, entirely in memory with aggressive refresh intervals of 1 to 2 seconds. The main index contains tens or hundreds of millions of older items, often stored on Solid State Drives (SSD) and optimized for read throughput and cost efficiency. Queries search both indexes and merge results with deduplication. This architecture solves the core tension in real time vector search. Continuous writes to a large index cause memory fragmentation, frequent graph rewiring, and compaction overhead that degrades query latency. By isolating recent updates in a hot index, you contain write amplification and keep the main index stable. A typical query budget allocates 50 to 80 milliseconds for top k retrieval across both indexes and 10 to 30 milliseconds for reranking. The hot index receives all incoming upserts and deletes from the streaming pipeline. It might hold 1 to 10 million items with refresh every 1 to 2 seconds. The main index is rebuilt on a slower cadence, perhaps daily or every few hours, by merging the previous main index with aged out items from the hot index. During queries, you fetch top k from hot, top k from main, merge with version based deduplication, and rerank to return the final top k. Scoring functions must be consistent across both indexes to avoid ranking inversions. Pinterest and Spotify use variants of this pattern. Spotify originally used Annoy, which requires periodic full rebuilds, so teams prebuilt indexes offline and swapped them during low traffic windows. The hot plus main pattern provides a middle ground, continuous updates for fresh content while keeping larger indexes stable and faster.
💡 Key Takeaways
Hot index holds 1 to 10 million recent items in memory with 1 to 2 second refresh, main index has 50 to 500 million items on SSD
Isolates write pressure to hot index, preventing fragmentation and compaction overhead in the large main index
Query budget typically 50 to 80 milliseconds for retrieval across both indexes and 10 to 30 milliseconds for reranking
Main index rebuilt daily or every few hours by merging previous main with aged out hot items, avoiding continuous graph rewiring
Requires version based deduplication during merge to handle items present in both indexes, and consistent scoring to avoid rank inversions
Spotify used full rebuilds with Annoy, swapping prebuilt indexes during low traffic, hot plus main provides continuous freshness without full swaps
📌 Examples
Pinterest home feed uses streaming feature pipeline with hot features in memory and main index on SSD, incorporating engagement within seconds
Typical setup: hot index with 5 million items at 768 dimensions uses 15 GB RAM, main index with 200 million items on NVMe SSD
Query flow: retrieve top 100 from hot (40ms), top 100 from main (50ms), deduplicate to 150 unique, rerank to final top 50 (20ms), total 110ms
← Back to Real-time Updates (Incremental Indexing) Overview