Fraud Detection & Anomaly DetectionFeature Engineering (Temporal Patterns, Aggregations, Velocity)Hard⏱️ ~3 min

Online and Offline Feature Computation Architecture

The Two-Pipeline Architecture

Temporal features require different computation strategies based on freshness requirements. Offline pipelines (batch) compute features over large windows (30-day averages, historical aggregations) on a schedule. Online pipelines (streaming) compute features in real-time from event streams. The feature store serves both, providing unified access regardless of computation path.

Architecture Pattern: Batch pipeline computes historical baselines nightly. Streaming pipeline updates short-window aggregations (hourly counts) in real-time. Serving layer joins both at inference time.

Offline Pipeline

Batch jobs run on data warehouses (Spark, BigQuery) processing full historical data. Compute 30-day averages, 90-day maximums, lifetime statistics. Output writes to feature store. Runs daily or hourly depending on freshness needs. Advantages: full data access, complex computations, no latency constraints. Disadvantages: features are stale until next batch run.

Online Pipeline

Streaming jobs process events in real-time (Kafka, Flink, Spark Streaming). Maintain sliding window state: increment counts on new events, decrement when events age out. Advantages: features reflect activity from seconds ago. Disadvantages: limited to incremental computations, state management complexity, higher operational cost.

Hybrid Strategy: Use batch for stable baseline features (30-day average). Use streaming for recent activity (1-hour count). At inference, fetch both and compute ratios (current_hour / baseline_30d) for velocity detection.

Consistency Challenges

Ensure batch and streaming compute features identically. Code divergence causes training-serving skew. Solutions: shared feature definitions, unit tests validating equivalence, periodic reconciliation comparing batch-computed and streaming-computed values for the same time windows.

💡 Key Takeaways
Batch pipelines compute historical baselines (30-day averages); streaming pipelines update short-window aggregations in real-time
Hybrid strategy: batch for stable baselines, streaming for recent activity, compute ratios at inference time
Ensure batch and streaming compute features identically—code divergence causes training-serving skew
📌 Interview Tips
1Use Spark/BigQuery for batch (full data, complex computations), Kafka/Flink for streaming (incremental, real-time)
2Periodic reconciliation compares batch-computed and streaming-computed values to detect implementation drift
← Back to Feature Engineering (Temporal Patterns, Aggregations, Velocity) Overview