Recommendation Systems • Collaborative Filtering (Matrix Factorization)Hard⏱️ ~3 min
Memory, Scale, and Operational Trade-offs at 100M+ Items
Operating Matrix Factorization at the scale of modern platforms (100 million items, hundreds of millions of users) requires careful capacity planning and operational discipline. Memory footprint, training cost, serving throughput, and freshness pipelines all have hard constraints that force tradeoffs.
Memory is the first bottleneck. Embeddings alone require (|U| + |I|) × k × 4 bytes. For 50 million users and 100 million items at k=64 dimensions: (150M × 64 × 4 bytes) equals approximately 38.4 GB just for the raw embeddings, plus biases, ANN index overhead (1.5x to 3x), and cache metadata. A full serving stack might need 100 to 150 GB per replica. To scale, you shard by item ID (each shard handles a range of items), replicate for throughput (10 to 20 replicas per region to handle tens of thousands of QPS), and quantize embeddings (float32 to int8 reduces size by 4x with minimal accuracy loss, bringing 38.4 GB down to 9.6 GB).
Training cost scales with the number of interactions and embedding dimension. Alternating Least Squares (ALS) on 1 billion interactions with k=64 might take 2 to 4 hours on a cluster of 50 to 100 machines (10 to 20 iterations). Increasing k from 64 to 128 doubles training time and memory but often yields only 1% to 3% accuracy improvement. Implicit feedback with confidence weighting is more expensive than explicit ratings because the loss is computed over the entire matrix (all user item pairs), not just observed entries. Production systems carefully tune k: start at 32 or 64, measure Recall@K on a holdout set, and increase only if the gain justifies the cost.
Freshness pipelines are the operational nightmare. Item embeddings retrain daily or hourly in batch jobs, requiring orchestration of data pipelines (collect interactions from logs), distributed training (ALS or SGD across cluster), ANN index rebuild (hours for 100M vectors), and coordinated deployment (version embeddings, switch traffic atomically). User embeddings can update in real time from interaction streams (Kafka, Kinesis) but must stay aligned with the current item embedding version. Misalignment causes serving inconsistencies: a user vector computed with yesterday's item factors scored against today's ANN index produces garbage. The solution is strict versioning, dual read during transitions (serve from both old and new indexes, compare scores, switch traffic gradually), and health checks (monitor embedding norms, coverage, latency P99).
The ultimate tradeoff is accuracy vs cost. A single Matrix Factorization model with k=64 running on CPU might cost 5,000 dollars per month in compute and serve 10,000 QPS. Increasing k to 256 or adding real time user updates can improve CTR by 3% to 5% but triple compute cost to 15,000 dollars per month. An ensemble of models or a deep neural network ranker on GPU might cost 50,000 dollars per month for an additional 1% to 2% CTR gain. Production teams obsessively measure cost per engagement (dollars per click or play) to justify these investments.
💡 Key Takeaways
•Memory: 150M entities at k=64 dims is 38.4 GB raw embeddings. With ANN index (2x overhead) and cache, full serving stack needs 100 to 150 GB per replica. Shard and replicate to scale
•Quantization from float32 to int8 reduces memory by 4x (38.4 GB to 9.6 GB) with only 1% to 2% accuracy loss. Critical for fitting large catalogs in memory at scale
•Training cost: ALS on 1B interactions with k=64 takes 2 to 4 hours on 50 to 100 machines. Doubling k to 128 doubles training time and memory for only 1% to 3% accuracy gain
•Freshness requires orchestrated pipelines: daily batch training (2 to 4 hours), ANN rebuild (1 to 3 hours for 100M vectors), versioned deployment, and real time user updates from streams
•Serving inconsistencies from version skew: user vector from old item factors scored against new ANN index produces wrong results. Use strict versioning, dual read transitions, and health checks
•Accuracy vs cost tradeoff: k=64 CPU model costs 5,000 dollars per month at 10K QPS. k=256 or real time updates improve CTR by 3% to 5% but cost 15,000 dollars per month (3x). Measure cost per engagement
📌 Examples
Spotify scale: 100M tracks, 200M users, k=64. Raw embeddings: (300M × 64 × 4) = 76.8 GB. With ANN and replication (20 replicas for 50K QPS globally), total memory: 2-3 TB across cluster
Cost example: Matrix Factorization at k=64 on CPU serves 10K QPS for 5,000 dollars per month. If CTR is 2% (200 clicks per 10K requests per second), cost per 1M clicks is approximately 250 dollars. Upgrading to k=128 improves CTR to 2.05% but costs 8,000 dollars per month, raising cost per 1M clicks to 390 dollars. Tradeoff depends on revenue per click