Training Pipeline and Offline Batch Feature Computation
Training Data for Personalization Models
Personalization models train on logged interactions: (user, query, item, context, outcome) tuples. The challenge: you need both long-term and short-term features reconstructed at training time. This means joining user profile snapshots with session logs at the exact timestamp of each interaction. Using current user profiles for historical interactions leaks future information.
Offline Feature Computation
Long-term user features (profile embeddings, category preferences, price sensitivity) are computed in daily batch jobs. Process: aggregate last 90 days of user interactions, compute feature values, store in offline feature store. These features change slowly, so daily refresh is sufficient. For training, snapshot these features and join by timestamp: training example from March 5th uses the user profile as it existed on March 5th, not today's profile.
Reconstructing Session Features for Training
Session features (clicks before this search, session embedding) must be reconstructed from logs. For a search at timestamp T, find all clicks by that user in the same session before T. Compute session embedding from those clicks. This is expensive: for each training example, replay the session up to that point. Optimization: pre-compute session state at regular checkpoints (every 5 minutes), then compute delta from checkpoint to example time.
Point-in-Time Correctness
Every feature must reflect its value at interaction time, not current time. User profile: use March 5th snapshot for March 5th examples. Session features: only include clicks before the search, not after. Item features: use item embedding as it existed then (items change: titles update, prices change). This prevents the model from learning to use future information that won't be available at serving time.
Serving-Training Parity
The same feature computation code must run in both training and serving. If offline batch uses Python and online serving uses Java, subtle differences cause training-serving skew. Solution: define features in a shared language or format, generate code for both environments, run automated tests comparing offline and online feature values for sampled examples. A 2% feature drift can cause 5-10% model quality degradation.