Learn→Feature Engineering & Feature Stores→Feature Transformation Pipelines (Spark, Flink)→5 of 6

Feature Engineering & Feature Stores • Feature Transformation Pipelines (Spark, Flink)Medium⏱️ ~3 min

Choosing Streaming vs Batch: Latency, Cost, and Operational Trade-offs

Exactly Once Semantics
Guarantees each event is processed exactly once, even with failures and retries. Flink achieves this through coordinated checkpointing: periodically snapshot operator state to durable storage, and on failure, restart from last checkpoint and replay events. Combined with idempotent sinks (upsert to key value store keyed by entity plus window), reprocessed events produce identical outputs. Without exactly once, aggregations like "total purchases" can over or undercount by the number of reprocessed events.
Idempotent Writes
The sink must handle duplicate writes without corrupting state. For a "purchase count in last hour" feature, writing to Redis with INCR (increment) is not idempotent since replayed events increment again. Instead, use HSET with entity plus window hash plus event ID as field, counting distinct fields. Or maintain append only event log and compute aggregates on read. DynamoDB conditional updates with version numbers provide idempotent upsert semantics.
Deduplication
Upstream event sources may send duplicates due to at least once delivery, producer retries, or application bugs. Deduplication uses event IDs stored in a short term cache (Redis with 1 hour TTL) to filter seen events. For high volume streams, bloom filters provide probabilistic deduplication with controlled false positive rate. Without deduplication, a 1 percent duplicate rate in click events inflates click counts by 1 percent.
Checkpoint Tuning
Frequent checkpoints (every 10 seconds) minimize data loss on failure but add overhead. Infrequent checkpoints (every 5 minutes) reduce overhead but require reprocessing more events on recovery. Balance based on event volume and acceptable recovery time. At 10,000 events per second, a 60 second checkpoint interval means replaying 600,000 events on failure.

💡 Key Takeaways

✓Streaming achieves sub-500ms latency with always on clusters costing $15k per month for 100M events per day, versus batch at $3k per month with minute level latency scaling to zero between runs

✓Use streaming for fraud detection, real time ranking, and ads bidding where sub-second freshness is required. Use batch for training backfills, daily aggregations, and weekly model retraining where hours of staleness is acceptable

✓Micro-batch provides 1 to 10 second latency with simpler operations than stateful streaming, bridging the gap for near real time use cases like content recommendations with 5 second freshness

✓Streaming operational overhead includes checkpoint management (10 to 60 second intervals, 1 to 5 minute recovery), backpressure monitoring, and state growth management. Batch has simpler failure recovery: rerun failed partitions

✓Event time correctness with watermarks in streaming handles out of order data precisely for temporal features. Batch approximates with processing time and partition boundaries, acceptable for coarse grained features

✓Cost scales differently: streaming pays for steady state capacity regardless of volume, batch pays per job execution. For bursty workloads (10x daily peaks), batch can be 5x cheaper than provisioning streaming for peak capacity

📌 Interview Tips

1Uber real time pricing: Streaming pipeline computes supply and demand features (drivers available in area, ride requests in last 10 minutes) with 300ms P99 latency. Always on Flink cluster. Switching to batch would miss surge pricing windows.

2Airbnb search ranking training: Daily Spark batch jobs generate 6 months of historical features for model retraining. Processes 50TB of listing views and bookings across 8k cores in 4 hours. Cost: $800 per day vs $25k per month for equivalent streaming cluster.

3LinkedIn feed ranking: Micro-batch with 10 second intervals computes engagement features (likes and comments on recent posts). Simpler than stateful streaming for this latency requirement. Reduces ops cost by 40% versus continuous streaming.

4Netflix recommendation backfills: Monthly full historical recompute of all user features over 2 years of viewing data. Spark batch across 12k cores, 200TB shuffle. Cost: $15k one time job. Streaming equivalent would cost $180k per month for unused capacity.

← Back to Feature Transformation Pipelines (Spark, Flink) Overview