Spark for Batch
Apache Spark dominates batch feature engineering with DataFrame and SQL APIs that scale to petabyte datasets. Key operations include window functions for rolling aggregates (sum of purchases in last 30 days), joins across multiple tables (user profile joined to transaction history), and user defined functions for custom transformations. Spark excels at throughput: 1 billion row aggregation completing in 10 to 30 minutes on a 50 node cluster. Incremental processing with Delta Lake or Hudi enables processing only new data since last run, cutting compute by 10 to 100x.
Flink for Streaming
Apache Flink provides true event time processing with watermarks for handling late data, exactly once semantics through checkpointing, and low latency windowed aggregations. For a "purchases in last 5 minutes" feature, Flink maintains per entity state, updates on each event, and emits results as windows close. Latency from event to feature availability is typically 1 to 30 seconds. Uber and Alibaba run Flink at millions of events per second for real time ML features.
Kafka Streams Alternative
For simpler streaming needs, Kafka Streams offers a lightweight library (no separate cluster) that processes data directly from Kafka. Suitable for stateless transformations, simple aggregations, and teams already operating Kafka. Trades off Flink power features like complex event processing and sophisticated windowing for operational simplicity.
Orchestration
Airflow or Dagster schedule batch pipelines with dependency management and retry logic. Feature freshness SLAs drive scheduling: a 1 hour freshness requirement means hourly job runs with 15 minute buffer for compute time and retries.
✓Keyed state partitions by entity ID, co-locating all events for a user or item to enable rolling features without external lookups, scaling to multi-terabyte state across clusters
✓Watermarks signal event time progress and trigger window emission. Calibrate watermark lag to observed P99 inter-arrival times: too tight drops late data, too loose delays outputs and inflates memory
✓Checkpoints snapshot state every 10 to 60 seconds for exactly once recovery. With 500GB state, full checkpoints take 3 to 5 minutes; incremental checkpoints reduce this to under 1 minute
✓State TTL aligned to feature horizon prevents unbounded growth. For 7 day rolling features, set TTL to 7 days plus watermark lag to evict old keys automatically
✓Approximate aggregations like HyperLogLog for cardinality (1.5% error with 12KB per key) or Count-Min Sketch for frequency keep state sub-linear for billion key workloads
✓Data skew on hot keys (super-active users) causes partition imbalance. Mitigate with key salting or local pre-aggregation before global aggregation to distribute load
1LinkedIn feed ranking: Flink pipeline maintains per-member rolling engagement features (likes, comments, shares over 30 minutes) with 10 minute session windows. State size: 800GB across 200 task slots. Checkpoint interval: 45 seconds. Recovery time: 2 minutes with incremental checkpoints.
2Uber real-time ETL: Stateful deduplication within 5 minute windows per trip ID to handle retry storms. State pruned with TTL equal to window size plus 2 minute watermark lag. Prevents duplicate fare calculations.
3Netflix viewing features: Per-user rolling count of genres watched in last 24 hours using sliding 1 hour windows. State compaction with approximate counting (lossy counting algorithm) keeps per-user state under 5KB while tracking thousands of genre interactions.