Vector Compression and Quantization Trade-offs for Dense Retrieval
Scalar Quantization
Convert each float32 (4 bytes) to int8 (1 byte) or smaller. Map the float range to 256 discrete levels. 4x memory reduction with 1-3% recall loss. Simple to implement, works with standard ANN libraries. Best for memory-constrained deployments where slight recall loss is acceptable.
Product Quantization (PQ)
Split the vector into subvectors (e.g., 768 dims into 96 subvectors of 8 dims each). Learn a codebook of centroids for each subvector. Represent each subvector by its nearest centroid index (1 byte). Result: 768-dim vector compressed to 96 bytes. 32x compression with 5-10% recall loss at high compression. Enables billion-scale search on single machines.
Dimensionality Reduction
Train with lower output dimensions (256 or 384 instead of 768). Less compression than quantization but no post-hoc accuracy loss. Alternatively, apply PCA after training; keep top dimensions capturing 95%+ variance. Combine with quantization: reduce to 384 dims, then quantize to int8, achieving 6x compression with minimal recall loss.