Dimensionality and Quantization Trade-offs

Higher dimensionality improves expressivity and recall by giving embeddings more degrees of freedom to separate semantically distinct items. Moving from 384 to 768 dimensions often yields 2 to 5 point gains in Recall at 10 and 1 to 3 point nDCG improvements. However, dimensionality doubles memory footprint and increases compute. At 100 million documents with 768 dimensions in float16 (2 bytes per value), raw vectors consume 768 × 2 × 100M = 153.6 GB. Doubling to 1536 dimensions adds another 153.6 GB, which at cloud storage rates of $0.02 per GB per month costs an extra $3,000 monthly just for storage, before indexing overhead or serving costs.

Search latency also grows with dimension. Approximate Nearest Neighbor (ANN) index construction time scales roughly linearly or worse with dimension. Query time inner product or cosine similarity computation scales linearly with dimension, adding microseconds per comparison but accumulating across millions of comparisons. Meta reports that moving from 512 to 1024 dimensions increased index build time by 80% and query latency by 30 to 40% at billion scale.

Quantization compresses embeddings by reducing precision: float32 to float16 cuts size by half, float16 to int8 cuts by another half (4x total), and product quantization or binary codes can achieve 8 to 32x compression. Pinterest uses 8 bit quantization on billions of pin embeddings, reducing memory from 1.2 TB to 300 GB while maintaining Recall at 100 within 1 point of uncompressed. However, quantization reduces fine grained similarity resolution. Hard negative pairs (items that share keywords but differ semantically) become harder to separate, potentially dropping nDCG by 1 to 5 points depending on domain and compression level.

The decision hinges on your constraint. If you are memory bound (mobile app, edge device, or cost sensitive cloud deployment), quantization is essential. If you are latency bound and have memory headroom, higher dimension without quantization might be optimal. Spotify uses 256 dimensions with float16 for track embeddings because the 50 GB memory footprint fits comfortably in serving nodes and the 10 to 15 ms query encoding plus 20 ms ANN search stays within budget. Google experiments show 768 dimensions with 8 bit quantization often beats 1536 dimensions uncompressed because the memory savings allow larger in memory indexes and faster retrieval.

💡 Key Takeaways

✓Doubling from 384 to 768 dimensions yields 2 to 5 point Recall@10 gain but doubles memory (76.8 GB to 153.6 GB for 100M docs) and increases latency 30 to 50%

✓At 100 million documents, 768D float16 uses 153.6 GB raw storage, costing $3,000 monthly extra vs 384D at cloud rates of $0.02 per GB per month

✓Quantization from float16 to int8 cuts memory by 50%, float32 to int8 by 75%, with typical nDCG drop of 1 to 5 points depending on compression and domain

✓Pinterest uses 8 bit quantization on billions of pins, reducing memory from 1.2 TB to 300 GB while keeping Recall@100 within 1 point of uncompressed

✓Google experiments show 768D int8 often outperforms 1536D float16 because memory savings enable larger in memory indexes and faster retrieval

✓Meta reports moving from 512D to 1024D increased index build time by 80% and query latency by 30 to 40% at billion scale, limiting practical dimension choices

📌 Interview Tips

1Spotify uses 256D float16 for track embeddings (50 GB memory for 100M tracks) because it fits serving nodes and keeps query encoding at 10 to 15 ms plus 20 ms ANN search

2A mobile app requiring on device embedding limits to 128D int8 (64 MB for 500k items) to fit memory constraints, accepting 3 point nDCG drop vs server 768D float16

3Pinterest reduces 768D float32 embeddings (3 bytes becomes 1 byte with quantization) from 1.2 TB to 300 GB, saving $18,000 annually in storage and serving costs

← Back to Embedding Quality Evaluation Overview