ML-Powered Search & RankingReal-time Search PersonalizationHard⏱️ ~3 min

Training Pipeline and Offline Batch Feature Computation

While real-time systems serve personalized rankings in milliseconds, most of the heavy computation happens offline in batch pipelines that process billions of interactions daily. These pipelines train embedding models, compute long-term user and item features, generate training datasets with counterfactual labels, and validate models before deployment. The offline and online systems must stay synchronized to avoid training serving skew, where feature definitions or distributions diverge and cause accuracy drops of 10 to 30 percent in production. The pipeline starts with raw event logs, such as impressions, clicks, dwell times, and conversions, stored in distributed file systems like HDFS or object stores like S3. A daily or hourly batch job sessionizes these events with the same 30 minute inactivity gap used online, ensuring consistency. It computes long-term aggregates such as favorite categories over 60 days with exponential decay, average booking price, click through rate by item, and item quality scores from conversion rates. These aggregates write to feature tables partitioned by date, typically using formats like Parquet for efficient columnar access. Embedding training runs weekly or biweekly on hundreds of millions to billions of click sequences. Airbnb uses skip gram with negative sampling, training 32 dimensional embeddings over 800 million sessions. For sessions ending in bookings, the booked listing is the global context. They apply market specific negative sampling to prevent cross market leakage, since user preferences in New York differ from Tokyo. Training takes hours on GPU clusters, producing embeddings stored in a model registry and replicated to search servers for online similarity computation. Training dataset generation joins event logs with feature tables to create labeled examples. Each example includes the query, candidate item, user and session features at impression time, and the label such as click or booking. Position information is crucial for counterfactual correction. The pipeline applies inverse propensity weighting based on position and adds exploration examples where items were randomly injected, giving unbiased labels. Typical dataset sizes reach hundreds of millions to billions of examples per day at Google or Amazon scale. Model training uses gradient boosted decision trees or neural networks, depending on latency and accuracy requirements. Airbnb's Experiences team used Gradient Boosted Decision Trees (GBDT) with 50 to 90 features, training daily on updated datasets. Training time is hours for GBDT, days for deep neural networks. Models are validated on held out data with metrics like Normalized Discounted Cumulative Gain (NDCG), Area Under Curve (AUC), and offline revenue estimates. Only models that beat the current production champion on both offline metrics and simulator replay are promoted to A/B test. Feature parity testing is critical. Before deployment, the pipeline compares offline computed features to online feature values on sample queries, checking that distributions match within tolerance. Airbnb enforces that mean and variance of key features differ by no more than 5 percent. Mismatches trigger alerts and block deployment. Without this, subtle bugs like different timezone handling or aggregation logic cause training serving skew that degrades accuracy by 20 percent or more in production. The tradeoff is freshness versus cost. Daily batch retraining captures trends within 24 hours but requires large compute clusters and engineering effort to manage dependencies and data quality. Airbnb saw that moving from weekly to daily personalization updates improved bookings by 5 percent in Experiences search. More aggressive hourly or real-time model updates using online learning can reduce latency to minutes but risk model instability from noisy data and require sophisticated monitoring to detect divergence.
💡 Key Takeaways
Daily batch pipelines sessionize billions of events with 30 minute gap, compute long-term aggregates over 60 days with decay, and write to Parquet feature tables partitioned by date
Embedding training runs weekly on 800 million sessions using skip gram with market specific negative sampling, taking hours on GPU clusters and producing 32D vectors stored in model registry
Training datasets join event logs with feature tables at impression time, apply inverse propensity weighting by position, and include exploration examples for unbiased labels at billions of examples per day
Gradient Boosted Decision Tree (GBDT) rankers with 50 to 90 features train daily in hours, validated on NDCG and AUC, with simulator replay before A/B test to prevent regressions
Feature parity testing compares offline and online features on sample queries, requiring mean and variance to match within 5 percent to prevent training serving skew that causes 20 percent accuracy drops
Daily retraining improves bookings by 5 percent over weekly updates but requires large compute clusters, while hourly online learning reduces lag to minutes at risk of instability
📌 Examples
Airbnb retrains embeddings weekly on 800 million sessions across 4.5 million listings, using skip gram with booked listing as global context and market specific negatives to prevent leakage
Google trains rankers on billions of query impression pairs daily, applying propensity weighting and validating on NDCG before gradual rollout to 1 percent of traffic for A/B testing
Amazon runs feature parity tests in continuous integration, comparing offline computed average purchase price to online values on 10,000 sample users and blocking deploy if variance exceeds 5 percent
← Back to Real-time Search Personalization Overview
Training Pipeline and Offline Batch Feature Computation | Real-time Search Personalization - System Overflow