Feature Engineering & Feature StoresFeature Transformation Pipelines (Spark, Flink)Medium⏱️ ~3 min

Feature Transformation Pipelines: Streaming vs Batch Architecture

Definition
Feature transformation pipelines convert raw event data into model ready features through orchestrated compute jobs. Batch pipelines (Spark, Hive) process historical data for training datasets. Streaming pipelines (Flink, Kafka Streams) compute features in near real time. The critical challenge is maintaining identical transformation logic across both to prevent training serving skew.

Batch Pipeline Role

Processes historical data in scheduled jobs (hourly, daily) to generate training datasets and backfill offline feature stores. A typical Spark job reads events from S3 or Hive, applies transformations like aggregations, joins, and encodings, and writes output to a feature table partitioned by date. Throughput optimized: processing 1 terabyte of events to generate 100 million feature rows might take 30 minutes on a 100 node Spark cluster.

Streaming Pipeline Role

Processes events in near real time to update online feature stores with fresh values. Flink or Kafka Streams read from Kafka topics, compute windowed aggregates with event time semantics, and write to Redis or DynamoDB. Latency optimized: features should reflect events within seconds to minutes. Uber runs streaming pipelines for ETA and surge features with end to end latency under 30 seconds.

Unified Logic Challenge

The same aggregation logic must produce identical results whether computed in Spark SQL or Flink. Differences in window semantics, timestamp handling, or null treatment cause training serving skew. Feature stores like Tecton address this by compiling a single feature definition to both batch and streaming execution plans.

💡 Key Takeaways
Streaming achieves sub-500ms latency with continuous event time processing and stateful operators maintaining hundreds of gigabytes to multi-terabyte keyed state
Batch and micro-batch provide 1 to 10 second latency with optimizations for high throughput historical recomputation across multi-petabyte datasets
Event time semantics with watermarks handle out of order data correctly for temporal features, while processing time windows approximate with simpler logic
Exactly once semantics require coordinated checkpoints every 10 to 60 seconds, adding 1 to 5 minute recovery times depending on state size
Production pipelines must maintain feature parity between offline training paths (batch computed) and online serving paths (streaming computed) to avoid training serving skew
📌 Interview Tips
1Netflix recommendations: Streaming pipeline ingests 50M events/s from Kafka, computes per-user rolling features (watch history over 30 days, genre preferences), maintains 2TB of keyed state, writes to lakehouse for training and online store for serving with 400ms P99 latency
2Uber fraud detection: Flink pipeline processes trip events in real time, maintains rolling aggregates per driver and rider (trips per hour, cancellation rate), achieves sub-second detection with exactly once guarantees via 30 second checkpoints
3Airbnb pricing features: Daily Spark batch jobs backfill historical price trends and seasonality features across 10k+ cores, processing 100TB shuffles for large joins between listings and booking history
← Back to Feature Transformation Pipelines (Spark, Flink) Overview
Feature Transformation Pipelines: Streaming vs Batch Architecture | Feature Transformation Pipelines (Spark, Flink) - System Overflow