ML-Powered Search & RankingDense Retrieval (BERT-based Embeddings)Hard⏱️ ~3 min

Vector Compression and Quantization Trade-offs for Dense Retrieval

Definition
Vector quantization compresses embeddings by representing them with fewer bits or lower-dimensional approximations. A 768-dim float32 vector (3KB) can compress to 96 bytes (32x reduction) with acceptable recall loss.

Scalar Quantization

Convert each float32 (4 bytes) to int8 (1 byte) or smaller. Map the float range to 256 discrete levels. 4x memory reduction with 1-3% recall loss. Simple to implement, works with standard ANN libraries. Best for memory-constrained deployments where slight recall loss is acceptable.

Product Quantization (PQ)

Split the vector into subvectors (e.g., 768 dims into 96 subvectors of 8 dims each). Learn a codebook of centroids for each subvector. Represent each subvector by its nearest centroid index (1 byte). Result: 768-dim vector compressed to 96 bytes. 32x compression with 5-10% recall loss at high compression. Enables billion-scale search on single machines.

Dimensionality Reduction

Train with lower output dimensions (256 or 384 instead of 768). Less compression than quantization but no post-hoc accuracy loss. Alternatively, apply PCA after training; keep top dimensions capturing 95%+ variance. Combine with quantization: reduce to 384 dims, then quantize to int8, achieving 6x compression with minimal recall loss.

⚠️ Trade-off: Higher compression = more recall loss. Measure on your data. Target: <5% recall@100 degradation for production systems.
💡 Key Takeaways
Scalar quantization: float32 → int8 gives 4x compression with 1-3% recall loss
Product quantization: 768-dim → 96 bytes (32x compression) with 5-10% recall loss
PQ enables billion-scale search on single machines through extreme compression
Dimensionality reduction: train at 256-384 dims or apply PCA post-training
Combine techniques: reduce dims + quantize for 6x+ compression with minimal recall loss
📌 Interview Tips
1Explain scalar vs product quantization trade-offs (4x/1-3% vs 32x/5-10%)
2Describe PQ mechanism (split into subvectors, codebook per subvector) for technical depth
3Recommend combined approach (dimension reduction + quantization) for best results
← Back to Dense Retrieval (BERT-based Embeddings) Overview