Batch vs Stream Processing • Batch vs Stream Processing Trade-offsMedium⏱️ ~3 min
Production Reality: Scaling Stream and Batch Together
At companies processing 500,000 to 5 million events per second, both batch and streaming pipelines coexist and share the same raw event stream. Understanding how they partition responsibility is crucial for system design interviews.
The Dual Pipeline Architecture: Events from mobile clients, web frontends, and backend services flow into a durable append only message bus with replication across availability zones. This bus retains data for 3 to 7 days, supporting both real time consumption and historical replay.
The streaming path reads from near the head of this log. Fraud detection systems might target p50 latency under 100 milliseconds and p99 under 300 milliseconds from event creation to blocking decision. Real time monitoring dashboards typically tolerate 5 to 60 seconds of lag. These stream processors feed low latency feature stores for machine learning models, alerting systems that page engineers, and real time personalization like dynamic pricing.
In parallel, the same events are landed into long term object storage, partitioned by date and hour. Batch jobs run on schedule: hourly aggregates for operational metrics, daily rollups for finance, and heavy feature engineering over months of history for model training. Latencies here are 30 minutes to 24 hours, but throughput can reach tens of terabytes per hour.
Handling Scale Transitions: To scale from 100k to 1M events per second, you increase partitioning of the message bus. If each partition handles 10 MB per second, you need 100 partitions at 1M events per second. Stream processors scale horizontally: with good key distribution, doubling partitions doubles consumer capacity linearly.
For batch, the challenge is skew. If one user or region generates 10x more events than others, a few partitions become bottlenecks. A job that should complete in 1 hour might take 6 hours because a handful of skewed tasks dominate. Solutions include salting keys, breaking hot partitions into smaller chunks, or using speculative execution to retry slow tasks.
✓ In Practice: Batch pipelines are the source of truth. Stream outputs are treated as best effort and eventually corrected by batch recomputation. This separation lets you iterate quickly on streaming logic without risking data integrity.
Concrete Example at Scale: Consider a payment platform handling 2 million transactions per second during peak. The streaming path detects fraud patterns in real time, blocking suspicious transactions within 200 milliseconds. Meanwhile, batch jobs process the full day of transactions overnight, running complex graph algorithms to identify fraud rings and feeding updated models back to the streaming system by morning.
The batch job might scan 500 TB of data using 10,000 cores for 2 hours, costing perhaps $300 in compute. The streaming system runs 24/7 on 200 dedicated cores, costing $5,000 per month. The trade off is explicit: batch optimizes cost per byte processed, streaming optimizes latency.
Cost Comparison at 2M Events/Sec
$300
BATCH DAILY
$5k/mo
STREAM 24/7
💡 Key Takeaways
✓At scale, streaming and batch share the same durable message bus retaining 3 to 7 days of events for both real time and backfill use cases
✓Streaming targets sub second latencies (p50 under 100ms for fraud) and runs 24/7; batch processes tens of terabytes with 30 minute to 24 hour latencies
✓Batch serves as source of truth with high fidelity; streaming provides best effort real time views that are eventually corrected by batch recomputation
✓To scale 10x, increase message bus partitions proportionally (1M events/sec needs ~100 partitions at 10 MB/sec each) and scale consumers horizontally
✓Skew is the primary batch scaling bottleneck: one hot partition extending job time from 1 hour to 6 hours requires salting keys or breaking partitions
📌 Examples
1Payment platform at 2M transactions per second uses streaming for 200ms fraud detection and batch overnight jobs scanning 500 TB on 10k cores for complex fraud ring analysis
2Streaming system costs $5,000 per month for 24/7 operation; batch job costs $300 per run using elastic compute for 2 hours, optimizing cost per byte
3Netflix uses streaming with 5 to 60 second lag for real time recommendations while batch pipelines process months of viewing history for model training