ML-Powered Search & RankingReal-time Search PersonalizationHard⏱️ ~3 min

Training Pipeline and Offline Batch Feature Computation

Training Data for Personalization Models

Personalization models train on logged interactions: (user, query, item, context, outcome) tuples. The challenge: you need both long-term and short-term features reconstructed at training time. This means joining user profile snapshots with session logs at the exact timestamp of each interaction. Using current user profiles for historical interactions leaks future information.

Offline Feature Computation

Long-term user features (profile embeddings, category preferences, price sensitivity) are computed in daily batch jobs. Process: aggregate last 90 days of user interactions, compute feature values, store in offline feature store. These features change slowly, so daily refresh is sufficient. For training, snapshot these features and join by timestamp: training example from March 5th uses the user profile as it existed on March 5th, not today's profile.

Reconstructing Session Features for Training

Session features (clicks before this search, session embedding) must be reconstructed from logs. For a search at timestamp T, find all clicks by that user in the same session before T. Compute session embedding from those clicks. This is expensive: for each training example, replay the session up to that point. Optimization: pre-compute session state at regular checkpoints (every 5 minutes), then compute delta from checkpoint to example time.

Point-in-Time Correctness

Every feature must reflect its value at interaction time, not current time. User profile: use March 5th snapshot for March 5th examples. Session features: only include clicks before the search, not after. Item features: use item embedding as it existed then (items change: titles update, prices change). This prevents the model from learning to use future information that won't be available at serving time.

Serving-Training Parity

The same feature computation code must run in both training and serving. If offline batch uses Python and online serving uses Java, subtle differences cause training-serving skew. Solution: define features in a shared language or format, generate code for both environments, run automated tests comparing offline and online feature values for sampled examples. A 2% feature drift can cause 5-10% model quality degradation.

💡 Key Takeaways
Training data joins user profiles, session logs, and item features at exact interaction timestamp to avoid future leakage
Long-term features computed in daily batch from 90-day history; stored with timestamps for point-in-time joins
Session features reconstructed by replaying clicks before each training example; checkpoint optimization reduces cost
Point-in-time correctness: every feature must reflect value at interaction time, not current time
Serving-training parity: same code path for both; 2% feature drift causes 5-10% quality degradation
📌 Interview Tips
1Explain session reconstruction: for search at time T, replay all clicks in that session before T
2Mention the checkpoint optimization: pre-compute session state every 5 minutes, compute delta to example time
3Emphasize point-in-time: March 5th example uses March 5th user profile snapshot, not today's profile
← Back to Real-time Search Personalization Overview