Recommendation Systems • Content-Based Filtering & Hybrid ApproachesEasy⏱️ ~2 min
What is Content Based Filtering in Recommendation Systems?
Content Based Filtering (CBF) recommends items by matching item attributes to a user's historical preferences. Instead of learning from what similar users liked, it analyzes the features of items themselves: text descriptions, images, audio properties, or structured metadata like genre and cast.
The system builds a user profile by aggregating feature vectors from items the user engaged with, typically weighted by interaction strength (a purchase counts more than a view) and recency (recent interactions matter more, often using exponential decay with a 7 to 14 day half life). When generating recommendations, it retrieves items whose feature vectors are most similar to this user profile using cosine or inner product similarity.
In production, this operates as a two stage pipeline. First, an Approximate Nearest Neighbor (ANN) search rapidly retrieves the top 500 to 5,000 candidates from millions of items in 5 to 30 milliseconds at the 95th percentile (P95). Then a re ranker scores a few hundred candidates using richer features and business constraints, taking another 50 to 150 milliseconds. Netflix uses this approach to recommend new titles before sufficient viewing data exists, improving cold start ramp up from days to hours.
The key advantage is immediate recommendations for new items without waiting for user interactions. The limitation is overspecialization: pure CBF tends to recommend near duplicates and creates filter bubbles, producing low novelty without explicit diversification strategies.
💡 Key Takeaways
•CBF models user preference as a function of item attributes, building a user profile from aggregated feature vectors of items the user engaged with, weighted by interaction strength and recency with typical 7 to 14 day exponential decay half life
•Production systems use two stage architecture: ANN retrieval pulls 500 to 5,000 candidates in 5 to 30ms P95, followed by re ranking 200 to 1,000 candidates in 50 to 150ms P95 for total latency under 200ms P95 to P99
•Solves cold start for new items immediately using content features before any user interactions exist, Netflix reports improvement from days to hours for new title ramp up
•Main limitation is overspecialization and filter bubbles: tends to recommend near duplicates with low novelty, requires explicit diversification strategies like maximal marginal relevance in re ranking
•Feature quality is critical: requires rich, accurate metadata across modalities (text, image, audio, structured attributes), with sparse or noisy content causing significant quality degradation
📌 Examples
Spotify uses audio content embeddings (timbre, rhythm, spectral features) plus text metadata to recommend new tracks with sparse interaction data, retrieving top similar tracks in single digit milliseconds from tens of millions of items per shard across 100M+ track corpus
Netflix extracts multi modal content embeddings from synopses, artwork, cast and crew graphs to recommend new titles within hours of launch, serving 250M+ members with sub 200ms P95 end to end home page latency