ML-Powered Search & RankingDense Retrieval (BERT-based Embeddings)Hard⏱️ ~3 min

Vector Compression and Quantization Trade-offs for Dense Retrieval

Memory cost dominates dense retrieval infrastructure at scale. A 768 dimensional float32 vector consumes 3 kilobytes. For 100 million documents, raw vector storage is 300 gigabytes. Replicating across 3 datacenters for availability brings total to 900 gigabytes before index overhead. This drives teams toward aggressive compression, but every compression method trades recall for memory. Float16 quantization is the simplest approach, halving memory by reducing precision from 32 bits to 16 bits per dimension. Modern CPUs and GPUs have native float16 support, so computation stays fast. Recall degradation is typically under 0.5 percentage points because 16 bit precision is sufficient for similarity ranking. This is the first step most teams take, reducing 100 million vectors from 300 gigabytes to 150 gigabytes with negligible quality loss. Product quantization (PQ) achieves 10 to 30 times compression but requires careful tuning. PQ splits the 768 dimensional vector into subvectors, typically 96 subvectors of 8 dimensions each. It then learns a codebook of 256 centroids per subvector using k means clustering. Each subvector is represented by a 1 byte codebook index instead of 8 float32 values, reducing 32 bytes to 1 byte per subvector. A 768 dimensional vector compresses from 3 kilobytes to 96 bytes. The trade off is a typical 2 to 5 point recall at 100 drop because quantization loses fine grained distance information. Careful codebook training on representative data minimizes this loss. Scalar quantization offers a middle ground. Instead of learning codebooks, it linearly maps float values to 8 bit integers per dimension. Each dimension's range is divided into 256 bins. A 768 dimensional float32 vector becomes 768 bytes, achieving 4 times compression. Recall loss is typically 1 to 2 points at recall at 100, better than product quantization but less compression. Scalar quantization is simpler to implement and doesn't require training codebooks, making it popular for teams that want better than float16 without PQ complexity. The choice depends on your serving constraints. If memory is abundant, float16 maximizes recall. If you need to serve from a single machine or minimize memory cost, product quantization at 64 to 96 bytes per vector enables billion scale indices. Meta uses FAISS with product quantization to compress embeddings for billion scale retrieval, achieving sub 10 gigabyte indices for tens of millions of items. Google's ScaNN uses learned quantization similar to product quantization with optimizations for x86 SIMD instructions, achieving strong recall at high compression. The operational lesson is to benchmark recall at k on your actual query distribution before committing, because compression artifacts interact with your specific embedding geometry.
💡 Key Takeaways
Float16 reduces memory by 2 times with under 0.5 point recall loss, the default first step for production systems with native hardware support
Product quantization achieves 10 to 30 times compression by learning codebooks of centroids, compressing 768 dimensional float32 from 3 kilobytes to 64 to 96 bytes with typical 2 to 5 point recall drop
Scalar quantization provides 4 times compression and 1 to 2 point recall loss as a middle ground, simpler than PQ because it avoids codebook training
Memory dominates cost at scale: 100 million float32 vectors need 300 gigabytes, product quantization reduces this to under 10 gigabytes enabling single machine deployment
Compression artifacts interact with specific embedding geometry, requiring benchmarking on actual query distribution before production deployment
📌 Examples
Meta FAISS uses product quantization with 96 byte vectors to serve billion scale retrieval, fitting tens of millions of items in under 10 gigabytes per shard
Google ScaNN implements learned quantization similar to product quantization with SIMD optimizations, achieving 20 to 30 times compression with strong recall on x86 CPUs
Amazon product search uses scalar quantization to int8 for 4 times compression, balancing simplicity and recall while avoiding the complexity of codebook training and updates
← Back to Dense Retrieval (BERT-based Embeddings) Overview