Recommendation SystemsCollaborative Filtering (Matrix Factorization)Medium⏱️ ~3 min

Matrix Factorization: Scaling Collaborative Filtering

Core Concept
Matrix factorization decomposes the sparse user-item rating matrix into two smaller dense matrices: user factors and item factors. The dot product of a user vector and item vector predicts the rating. This compresses millions of sparse entries into thousands of dense parameters.

How It Works

Represent each user as a vector of K dimensions (typically 50-200). Represent each item as a vector of the same K dimensions. The predicted rating is the dot product: rating(u,i) = user_vector[u] · item_vector[i]. With 100K users and 10K items, instead of storing 1 billion potential ratings, you store 100K×K + 10K×K parameters.

Training adjusts user and item vectors so that predictions match observed ratings. Use gradient descent: for each known rating, compute the prediction error, then update both vectors to reduce error. After millions of updates across all known ratings, the vectors encode latent preferences that generalize to unrated items.

The Latent Factors

Each dimension captures some latent concept. For movies, dimensions might correlate with "action vs drama", "serious vs funny", or "mainstream vs indie". The model discovers these automatically. A user with high values in certain dimensions prefers items with high values in those same dimensions.

You do not label or interpret the dimensions. They emerge from optimization. After training, you can sometimes interpret them by looking at which items score highest on each dimension, but this is not guaranteed.

💡 Interview Tip: When asked about collaborative filtering at scale, immediately mention matrix factorization. Nearest-neighbor methods struggle beyond 10K users. Matrix factorization scales to millions because prediction is a single dot product, and training is embarrassingly parallel across observed ratings.
💡 Key Takeaways
Problem: comparing 10M users directly requires 100 trillion similarity computations. Does not scale.
Solution: represent each user and item as a small vector (128 numbers). Similarity = dot product of vectors. Fast and scalable.
The 128 dimensions are hidden preferences learned from data: maybe dim 1 = likes action, dim 2 = prefers long films. Model learns these without labels
Prediction: user vector dot-product item vector. Alice [0.5, 0.3, ...] · Matrix [0.8, 0.2, ...] = predicted rating
Training: start with random vectors, predict known ratings, measure error, adjust vectors to reduce error. Repeat millions of times until vectors reconstruct known ratings well
Compression: 10T cell matrix becomes 1.4B numbers (10M×128 user vectors + 1M×128 item vectors). Can predict any of 10T cells with one dot product
📌 Interview Tips
1When explaining ALS: describe the alternating optimization - fix users, solve for items (closed-form least squares), then fix items, solve for users. Repeat 10-20 iterations.
2For distributed training: explain that ALS parallelizes naturally - each user/item update is independent given the other matrix, enabling 100+ machine clusters.
3When asked about convergence: mention that ALS typically converges in 10-20 iterations, with each iteration taking O(nnz × k²) time where nnz is number of non-zero interactions.
← Back to Collaborative Filtering (Matrix Factorization) Overview