Batch vs Stream Processing • Hybrid Batch-Stream ProcessingMedium⏱️ ~3 min
Kappa Architecture and Unified Engines
Simplifying with Kappa:
Kappa architecture emerged as a reaction to Lambda's complexity. The insight: if your streaming system is powerful enough, why maintain two separate code paths? Just treat everything as a stream, and implement batch processing by replaying the event log from the beginning.
This requires two architectural foundations. First, you need a durable, replayable event log with long retention, typically weeks or months. Second, your streaming applications must scale to handle full historical throughput, not just real-time arrival rates.
How Replay Works:
To recompute results for a specific time range, you reset your application state to a checkpoint before that range, then replay events at maximum speed. For example, to fix a bug that affected last week's metrics, you roll back to 7 days ago and replay 7 days of events.
The key difference from Lambda: you have one code path. The same application logic processes both real-time and historical data. This eliminates schema drift and logic divergence between batch and stream layers.
This approach reduces operational complexity compared to Lambda. You have one storage layer, one set of schemas, one query interface. But it demands more sophisticated infrastructure: your streaming engine must handle both incremental updates and full scans efficiently, and your storage must support high throughput writes while serving low latency reads.
Trade-offs Versus Lambda:
Kappa and unified approaches reduce code duplication and eliminate drift between batch and stream logic. However, they require streaming infrastructure provisioned for peak historical throughput, not just real-time load. For systems processing tens of terabytes daily, this can mean significantly higher baseline costs.
Lambda's separated paths let you optimize each independently: batch jobs run on cheaper spot instances during off-peak hours, while streaming runs on reserved capacity for predictable latency. Unified systems pay for streaming infrastructure capable of handling batch scale workloads.
✓ In Practice: Kappa works best when your historical reprocessing volume is manageable. If you need to replay billions of events regularly, you need infrastructure that can burst to 10x or 100x your normal throughput without overwhelming downstream systems.
Unified Engines and Lakehouse Storage:
Modern unified engines take this further by treating batch and stream as variants of the same computation model. You write jobs once using a unified Application Programming Interface (API) that works for both bounded (batch) and unbounded (stream) data.
The engine handles the complexity underneath: incremental processing, watermarking, state management, and exactly once semantics. The storage layer provides transactional guarantees, time travel queries, and change data capture.
The Pattern in Action:
Streaming ingestion lands events into a transactional table with ACID (Atomicity, Consistency, Isolation, Durability) guarantees. Micro batch or continuous jobs update the same table with both historical corrections and real-time inserts. Downstream consumers query a single table with snapshot isolation, seeing a consistent view regardless of whether data came from batch backfill or streaming update.
Infrastructure Scaling Requirements
1x
NORMAL RATE
10-100x
REPLAY BURST
💡 Key Takeaways
✓Kappa eliminates duplicate logic by using one streaming code path for both real-time and historical processing, with batch implemented as replay from log beginning
✓Requires event log with long retention (weeks to months) and streaming infrastructure that can burst to 10x to 100x normal throughput for historical replays
✓Unified engines abstract away bounded versus unbounded data distinction, letting developers write jobs once that work for both batch scans and streaming updates
✓Storage layer must support ACID transactions, time travel, and high write throughput while serving low latency reads, which is more complex than Lambda's separated stores
📌 Examples
1To fix a bug affecting last week's metrics, reset application to checkpoint 7 days ago and replay 7 days of events through the same streaming logic
2Micro batch jobs update transactional table every 5 minutes with both late arriving events and new real-time data, consumers query with snapshot isolation
3System processing 50 terabytes daily needs streaming cluster that can scale from 1x normal rate to 100x during full reprocessing without overwhelming downstream feature stores