Embeddings & Similarity SearchIndex Management (Building, Updating, Sharding)Hard⏱️ ~2 min

Trade-offs: Freshness, Recall, Latency, and Cost

Every index management decision trades off four constraints: freshness (how quickly new items are searchable), recall at K (what fraction of true top K you return), latency (p50 and p99 query time), and cost (CPU, memory, storage). Optimizing one degrades others. Freshness versus throughput is the classic tradeoff. Refreshing the index every 1 second makes new items searchable almost immediately, ideal for real time inventory or trending content. But frequent refreshes create many small segments that must be merged, increasing write amplification by 3x to 5x and spiking p99 latencies during merge. Batching updates to 30 second or 5 minute windows improves throughput from 2,000 to 10,000 writes per second per shard but increases staleness. Elasticsearch users targeting sub second freshness often provision 2x to 3x more CPU and memory to absorb merge load. Latency versus recall is tuned via probing parameters. For Inverted File indexes, probing 10 lists gives 90 percent recall at 10 milliseconds per query. Probing 50 lists reaches 98 percent recall but takes 35 milliseconds. For Hierarchical Navigable Small World graphs, exploring 32 neighbors yields 92 percent recall at 12 milliseconds, while 128 neighbors achieves 99 percent recall at 45 milliseconds. The marginal cost of the last 5 percent of recall often doubles latency. Meta and Google tune for 95 to 97 percent recall in production, accepting that 3 to 5 of the true top 100 items are missed to stay within latency budgets. Memory versus accuracy is controlled by quantization. Product Quantization at 16 bytes per vector reduces memory by 15x versus float32 but introduces distance errors that drop recall by 1 to 3 percent. Residual quantization adds 8 bytes per vector, improving recall by 2 percent but increasing memory cost by 50 percent. At scale, this is significant: 500 million vectors at 24 bytes is 12 gigabytes versus 8 gigabytes at 16 bytes, requiring 50 percent more nodes.
💡 Key Takeaways
Freshness under 1 second requires 2x to 3x compute. Frequent segment writes and merges spike CPU and IO. Elasticsearch users report refresh intervals below 5 seconds cause constant merge activity, pushing CPU utilization from 40 percent to 80 percent.
Recall and latency follow a power law curve. The first 90 percent recall is cheap (10 milliseconds), but reaching 99 percent costs 4x more latency (40 milliseconds). Google and Meta target 95 to 97 percent recall as the sweet spot.
Quantization saves memory but costs accuracy. Product Quantization at 8 bytes per vector achieves 93 percent recall, 16 bytes reaches 96 percent, and 32 bytes hits 98 percent. Each doubling of bytes gains 2 to 3 percent recall.
Shard count trades efficiency for isolation. 32 shards provide better failure isolation and write distribution than 8 shards, but coordination overhead increases from 5 milliseconds to 15 milliseconds due to more network round trips.
Global versus local indexes trade consistency for simplicity. A global index spanning shards returns perfect top K but is expensive to maintain under writes. Local indexes per shard are simpler and faster to update but require fanout and approximate merging.
📌 Examples
Pinterest uses 30 second refresh intervals for pin search, balancing sub minute freshness with manageable merge load. This supports 5,000 writes per second per shard at 25 millisecond p99 latency.
Meta FAISS deployments tune Inverted File probing to 32 lists out of 4,096, achieving 96 percent recall at 20 millisecond p99. Probing 64 lists would reach 98 percent recall but double latency to 40 milliseconds.
Spotify reported memory constraints with full precision embeddings. Switching to 16 byte Product Quantization reduced memory from 60 gigabytes to 5 gigabytes per 100 million vectors, with recall dropping from 98 percent to 96 percent.
← Back to Index Management (Building, Updating, Sharding) Overview
Trade-offs: Freshness, Recall, Latency, and Cost | Index Management (Building, Updating, Sharding) - System Overflow