Recommendation Systems • Collaborative Filtering (Matrix Factorization)Medium⏱️ ~3 min
Training Algorithms: Alternating Least Squares vs Stochastic Gradient Descent
Two training algorithms dominate Matrix Factorization in production: Alternating Least Squares (ALS) and Stochastic Gradient Descent (SGD). Each makes fundamentally different tradeoffs between parallelism, convergence guarantees, and flexibility.
ALS alternates between two convex optimization problems. First, fix all item vectors and solve for the best user vectors (this is a least squares problem for each user independently). Second, fix all user vectors and solve for the best item vectors. Each subproblem has a closed form solution that can be computed in parallel across all users or all items. This embarrassingly parallel structure scales beautifully: you can distribute 100 million users across a cluster and solve their least squares problems simultaneously. ALS is particularly robust for implicit feedback at scale and typically converges in 10 to 20 iterations. The downside is it is inherently batch oriented (you retrain all factors periodically) and limited to squared loss objectives.
SGD samples individual user item pairs (or triplets for ranking losses), computes gradients, and updates the corresponding user and item vectors incrementally. This supports richer objectives like Bayesian Personalized Ranking (BPR), Weighted Approximate Rank Pairwise (WARP), or custom losses. SGD enables online learning: when a user plays a new song, you can freeze the item vectors and run a few SGD steps to update just that user vector within milliseconds, providing fresh recommendations immediately. The challenge is SGD is harder to parallelize efficiently (updates to shared item vectors can conflict) and more sensitive to learning rates, regularization, and initialization.
Production systems often combine both. Use ALS for daily or hourly batch retraining of the full model (stable, scalable, handles billions of interactions). Use SGD for online user vector updates between batch jobs (freshness, captures just in time signals like current session). For example, retrain all 100 million item embeddings overnight with ALS, then update user embeddings in real time via a few SGD steps when they interact, keeping item vectors frozen until the next batch.
💡 Key Takeaways
•ALS solves convex least squares subproblems alternately, enabling embarrassingly parallel training across millions of users or items simultaneously. Typical convergence in 10 to 20 iterations on billions of interactions
•SGD supports richer objectives like pairwise ranking (BPR, WARP) and enables online updates. Freeze item vectors and run 3 to 5 SGD steps on a user vector in under 10ms when they interact
•ALS is batch oriented (daily or hourly full retrain) but rock solid at scale. SGD provides freshness but requires careful learning rate schedules, regularization, and drift monitoring
•Production hybrid: ALS for stable overnight retraining of all 100M item embeddings, SGD for real time user embedding updates between batches, keeping items frozen
•SGD is harder to distribute due to conflicting updates on shared item vectors. Techniques include Hogwild (tolerate race conditions), parameter servers, or partitioning by user to isolate updates
📌 Examples
Spotify scale ALS: Train on 200M daily user interactions with 100M tracks and 200M users. Distribute across 100 machines, each solving least squares for 2M users in parallel. Converges in 15 iterations over 2 hours
Online SGD update: User plays a new song. Fetch user vector from cache (64 dims), fetch item vector (64 dims), run 5 SGD steps with frozen item vector and learning rate 0.01, update user cache. Total latency: 8ms