Feature Engineering & Feature Stores • Feature Transformation Pipelines (Spark, Flink)Medium⏱️ ~3 min
Choosing Streaming vs Batch: Latency, Cost, and Operational Trade-offs
The choice between streaming and batch for feature pipelines depends on freshness Service Level Agreements (SLAs), cost constraints, and operational complexity. Streaming with Flink style continuous processing achieves sub-500ms end to end latency but runs always on clusters with 24/7 resource costs and operational overhead for checkpoints, backpressure management, and state growth. Batch with Spark provides 1 to 10 second latency for micro-batch or minutes for full batch, but can scale to zero between runs, making it 3x to 5x cheaper for workloads without strict freshness requirements.
For use cases like fraud detection, ads bidding, or real time ranking, sub-second latency is non-negotiable. A fraud model needs current transaction features (amount, merchant, recent activity in last 5 minutes) to block fraudulent charges before authorization completes. Streaming is the only viable option. Conversely, features for weekly recommendation model retraining or daily dashboard aggregations tolerate hours of staleness. Batch jobs scheduled nightly process multi-petabyte historical data with massive parallelism (10k+ cores) at lower total cost than keeping an equivalent streaming cluster running continuously.
Micro-batch bridges the gap for near real time use cases like 1 to 10 second freshness for content recommendations or notification triggers. It provides simpler operations than continuous streaming (no stateful checkpoints, easier to reason about boundaries) while maintaining compatibility with existing batch ETL tooling and lakehouse formats. The latency floor is the micro-batch interval: a 5 second micro-batch cannot serve sub-second requests.
Cost posture differs fundamentally. Streaming clusters incur steady compute (CPU and memory for operators), network (shuffles and remote state access), and storage (checkpoints and state backends) costs regardless of event volume. Batch jobs pay primarily for compute during execution and storage for persisted outputs, with costs scaling down to zero between runs. For a 100 million event per day workload, streaming might cost $15k per month in always on cluster capacity, while batch costs $3k per month in ephemeral compute and storage.
💡 Key Takeaways
•Streaming achieves sub-500ms latency with always on clusters costing $15k per month for 100M events per day, versus batch at $3k per month with minute level latency scaling to zero between runs
•Use streaming for fraud detection, real time ranking, and ads bidding where sub-second freshness is required. Use batch for training backfills, daily aggregations, and weekly model retraining where hours of staleness is acceptable
•Micro-batch provides 1 to 10 second latency with simpler operations than stateful streaming, bridging the gap for near real time use cases like content recommendations with 5 second freshness
•Streaming operational overhead includes checkpoint management (10 to 60 second intervals, 1 to 5 minute recovery), backpressure monitoring, and state growth management. Batch has simpler failure recovery: rerun failed partitions
•Event time correctness with watermarks in streaming handles out of order data precisely for temporal features. Batch approximates with processing time and partition boundaries, acceptable for coarse grained features
•Cost scales differently: streaming pays for steady state capacity regardless of volume, batch pays per job execution. For bursty workloads (10x daily peaks), batch can be 5x cheaper than provisioning streaming for peak capacity
📌 Examples
Uber real time pricing: Streaming pipeline computes supply and demand features (drivers available in area, ride requests in last 10 minutes) with 300ms P99 latency. Always on Flink cluster. Switching to batch would miss surge pricing windows.
Airbnb search ranking training: Daily Spark batch jobs generate 6 months of historical features for model retraining. Processes 50TB of listing views and bookings across 8k cores in 4 hours. Cost: $800 per day vs $25k per month for equivalent streaming cluster.
LinkedIn feed ranking: Micro-batch with 10 second intervals computes engagement features (likes and comments on recent posts). Simpler than stateful streaming for this latency requirement. Reduces ops cost by 40% versus continuous streaming.
Netflix recommendation backfills: Monthly full historical recompute of all user features over 2 years of viewing data. Spark batch across 12k cores, 200TB shuffle. Cost: $15k one time job. Streaming equivalent would cost $180k per month for unused capacity.