Feature Engineering & Feature Stores • Feature Transformation Pipelines (Spark, Flink)Medium⏱️ ~3 min
Feature Transformation Pipelines: Streaming vs Batch Architecture
Feature transformation pipelines compute model features from raw events with strict requirements for freshness, reproducibility, and consistency. Two execution models dominate production: continuous streaming (Flink style) and micro-batch or batch (Spark style). Both express transformations as Directed Acyclic Graphs (DAGs) of operators like map, filter, join, and aggregate, but they fundamentally differ in time semantics and latency characteristics.
Streaming pipelines run continuously with event time semantics, watermarks, and stateful functions. They achieve sub-500ms end to end latency with exactly once state updates via checkpointing. Stateful operators maintain keyed state at hundreds of gigabytes to multi-terabyte scale, enabling features like rolling counts over the last 5 minutes per user without external database round trips.
Batch and micro-batch pipelines execute over bounded datasets or small time windows. Spark style batch delivers high throughput optimizations like whole stage code generation and columnar formats, making it ideal for large historical recomputations. Micro-batch streaming provides 1 to 10 second latency with stronger compatibility with existing Extract, Transform, Load (ETL) infrastructure.
The core design challenge is maintaining feature parity across offline training paths and online serving paths. Training requires historical features computed in batch over large datasets, while serving demands low latency real time features. This dual requirement drives architectural decisions around state management, time correctness, and exactly once semantics.
💡 Key Takeaways
•Streaming achieves sub-500ms latency with continuous event time processing and stateful operators maintaining hundreds of gigabytes to multi-terabyte keyed state
•Batch and micro-batch provide 1 to 10 second latency with optimizations for high throughput historical recomputation across multi-petabyte datasets
•Event time semantics with watermarks handle out of order data correctly for temporal features, while processing time windows approximate with simpler logic
•Exactly once semantics require coordinated checkpoints every 10 to 60 seconds, adding 1 to 5 minute recovery times depending on state size
•Production pipelines must maintain feature parity between offline training paths (batch computed) and online serving paths (streaming computed) to avoid training serving skew
📌 Examples
Netflix recommendations: Streaming pipeline ingests 50M events/s from Kafka, computes per-user rolling features (watch history over 30 days, genre preferences), maintains 2TB of keyed state, writes to lakehouse for training and online store for serving with 400ms P99 latency
Uber fraud detection: Flink pipeline processes trip events in real time, maintains rolling aggregates per driver and rider (trips per hour, cancellation rate), achieves sub-second detection with exactly once guarantees via 30 second checkpoints
Airbnb pricing features: Daily Spark batch jobs backfill historical price trends and seasonality features across 10k+ cores, processing 100TB shuffles for large joins between listings and booking history