Online and Offline Feature Computation Architecture
The Two-Pipeline Architecture
Temporal features require different computation strategies based on freshness requirements. Offline pipelines (batch) compute features over large windows (30-day averages, historical aggregations) on a schedule. Online pipelines (streaming) compute features in real-time from event streams. The feature store serves both, providing unified access regardless of computation path.
Architecture Pattern: Batch pipeline computes historical baselines nightly. Streaming pipeline updates short-window aggregations (hourly counts) in real-time. Serving layer joins both at inference time.
Offline Pipeline
Batch jobs run on data warehouses (Spark, BigQuery) processing full historical data. Compute 30-day averages, 90-day maximums, lifetime statistics. Output writes to feature store. Runs daily or hourly depending on freshness needs. Advantages: full data access, complex computations, no latency constraints. Disadvantages: features are stale until next batch run.
Online Pipeline
Streaming jobs process events in real-time (Kafka, Flink, Spark Streaming). Maintain sliding window state: increment counts on new events, decrement when events age out. Advantages: features reflect activity from seconds ago. Disadvantages: limited to incremental computations, state management complexity, higher operational cost.
Hybrid Strategy: Use batch for stable baseline features (30-day average). Use streaming for recent activity (1-hour count). At inference, fetch both and compute ratios (current_hour / baseline_30d) for velocity detection.
Consistency Challenges
Ensure batch and streaming compute features identically. Code divergence causes training-serving skew. Solutions: shared feature definitions, unit tests validating equivalence, periodic reconciliation comparing batch-computed and streaming-computed values for the same time windows.