Embeddings & Similarity SearchIndex Management (Building, Updating, Sharding)Medium⏱️ ~2 min

Index Building: Batch Construction vs Incremental Updates

Index building determines how you materialize search structures. The fundamental choice is batch building from scratch versus incremental updates as new items arrive. Batch building processes the full corpus offline. For a 500 million item product catalog, this might run nightly on a Spark cluster. You train quantizers on sampled data (100,000 vectors is often sufficient for k means clustering), build the complete index structure, snapshot to object storage, and swap via an alias in seconds. This produces compact, optimal structures. Google ScaNN and Meta FAISS both support bulk builds that leverage GPU acceleration for training and construction, completing billion vector indexes in hours. The tradeoff is staleness: items added during the day won't be searchable until the next build completes. Incremental building appends new items to a delta structure. Inverted indexes write small immutable segments, similar to Lucene or Elasticsearch. Vector indexes maintain an overlay graph or bucket new vectors into coarse partitions without full retraining. Spotify's approach with Annoy favored rebuilds because the tree structure is hard to update incrementally, but systems like Hierarchical Navigable Small World (HNSW) graphs support inserts by adding nodes and relinking neighbors in milliseconds per vector. Production systems combine both. A main index is rebuilt daily or weekly for optimal compression and structure. A delta index handles real time updates at 1,000 to 5,000 writes per second. Queries merge results from both. Pinterest reports using this pattern: a nightly batch build for the corpus plus a streaming delta for fresh pins, achieving sub minute freshness with 98 percent recall.
💡 Key Takeaways
Batch builds are optimal for structure and compression. A nightly rebuild on 500 million vectors with Product Quantization yields 10 to 20 bytes per vector and 98 percent recall, versus 30 to 40 bytes and 95 percent recall with incremental updates.
Incremental updates maintain freshness. A delta index handling 5,000 writes per second makes new items searchable in under 60 seconds, critical for real time inventory or trending content.
Training overhead is significant. Training Product Quantization codebooks or building Hierarchical Navigable Small World graphs on 100 million vectors can take 2 to 6 hours on GPU clusters. This is amortized in batch builds but prohibitive for per item updates.
Write amplification increases with incremental updates. Each update creates a small segment that must be merged. Elasticsearch users report 3x to 5x write amplification when refresh intervals drop below 5 seconds.
Blue green deployments avoid downtime. Build the new index in parallel, double write updates to both old and new during cutover, then switch an alias atomically. Rollback takes seconds if the new index has issues.
📌 Examples
Meta rebuilds FAISS indexes nightly for Facebook Search, training quantizers on sampled data and swapping via alias. Delta updates go to a small HNSW overlay that merges at query time.
Google uses ScaNN with daily batch builds for YouTube recommendations, achieving 2x to 10x CPU speedup over prior methods while maintaining 99 percent recall at k equals 100.
Elasticsearch clusters for log search rebuild daily indexes at midnight, while streaming logs write to a current day index with 30 second refresh, balancing freshness and merge load.
← Back to Index Management (Building, Updating, Sharding) Overview
Index Building: Batch Construction vs Incremental Updates | Index Management (Building, Updating, Sharding) - System Overflow