Embeddings & Similarity Search • Real-time Updates (Incremental Indexing)Medium⏱️ ~2 min
Operational Metrics and Failure Detection
Monitoring real time incremental indexing requires tracking both write path health and read path quality. Indexing lag measures the time from event emission to queryable state, typically targeting p95 under 2 to 5 seconds. Queue depth shows backlog size, spikes indicate write storms or slow indexers. Per shard write throughput should stay within capacity, often 1,000 to 5,000 upserts per second. Refresh latency tracks how long it takes to make new segments or graph updates visible to queries, targeting 1 to 2 seconds for hot indexes.
For vector indexes, track memory per million vectors as a leading indicator of cost and capacity. Typical values range from 1 to 4 GB per million at 256 to 768 dimensions depending on graph degree and quantization. Monitor tombstone count and ratio, if it exceeds 10 to 20 percent, schedule compaction to prevent recall degradation. Graph rewiring rate during insertions affects write throughput, sudden spikes indicate hotspots or algorithmic issues.
Query side metrics include p95 and p99 latency, broken down by hot index and main index. Correlate latency spikes with compaction or maintenance tasks to identify interference. Track recall or precision at k if you have ground truth or can sample. For production systems, monitor business metrics like click through rate, null rate (queries returning no results), and user engagement. A sudden drop in click through rate by 5 percent or more often signals index corruption, stale embeddings, or training serving skew.
Embedding service throughput is critical for write path capacity. Small encoder models on Central Processing Unit (CPU) produce 50 to 200 embeddings per second per core. Batching on Graphics Processing Unit (GPU) can reach 2,000 to 10,000 embeddings per second depending on model size and batch size. If embedding generation becomes the bottleneck, queries will see stale results even if the index itself is fast. Use priority lanes so that query triggered embeddings are never starved by batch backfills or maintenance reprocessing.
💡 Key Takeaways
•Indexing lag p95 target under 2 to 5 seconds, queue depth spikes indicate write storms, per shard throughput often 1,000 to 5,000 upserts per second
•Memory per million vectors ranges from 1 to 4 GB at 256 to 768 dimensions, tombstone ratio over 10 to 20 percent requires compaction to maintain recall
•Query p95 and p99 latency should correlate with maintenance tasks, sudden spikes during compaction indicate resource contention or configuration issues
•Monitor business metrics like click through rate and null rate, drops of 5 percent or more signal index corruption or training serving skew
•Embedding throughput: 50 to 200 per second per CPU core, 2,000 to 10,000 per second per GPU with batching, bottleneck here causes stale results
•Priority lanes ensure query triggered embeddings are not starved by batch backfills, critical for maintaining low latency on read path
📌 Examples
Alert setup: indexing lag p95 exceeds 5 seconds for 3 minutes, page on call, queue depth over 100,000 for 5 minutes triggers auto scaling
Compaction correlation: p99 query latency spikes from 80ms to 200ms during nightly compaction, move compaction to off peak or add CPU budget
Embedding bottleneck: index can handle 5,000 upserts per second but embedding service maxes at 2,000 per second on CPU, deploy GPU batch service to unblock