Stream Processing Architectures • Stateful Stream ProcessingHard⏱️ ~3 min
When to Use Stateful vs Stateless Processing
The Core Decision Framework:
The choice between stateful and stateless stream processing comes down to three questions: Do you need to look at multiple events together? What is your latency budget? What operational complexity can you handle?
Choose Stateless Processing When:
Your logic is purely event level transformations. Filtering, format conversion, enrichment from static lookup tables, and routing can all be done without maintaining state across events. Stateless processing scales trivially: add more instances and you are done. There is no state to partition, checkpoint, or recover.
Stateless systems can achieve ultra low latency. Without state lookups or checkpointing overhead, you can process 5 to 10 million events per second per core with sub millisecond p99 latency. If your use case is high throughput routing or transformation with no aggregation, stateless wins on simplicity and speed.
You can also use stateless processing with an external store. For example, a payment service might use stateless stream processors to ingest events and query a Redis cluster for user risk scores. This works if the external store can handle the query load (often 10 to 50 thousand queries per second per node) and if you can tolerate the added latency (typically 2 to 10 milliseconds network round trip).
Choose Stateful Processing When:
You need joins, aggregations, or session logic that depends on multiple events. Computing "transactions per user per minute" or "users who viewed product A then B" requires maintaining state across events. Stateful processing gives you this with local, low latency access (1 to 3 milliseconds) and strong consistency guarantees tied to input offsets.
Your latency budget is 50 milliseconds to a few hundred milliseconds end to end. For fraud detection, you might need to decide within 100 milliseconds of receiving a transaction. Stateful processing with embedded state stores can hit this target. If your budget is 5 milliseconds or less, the overhead of checkpointing and state management might be too high, and you should consider stateless with carefully tuned external stores.
You can invest in operational complexity. Stateful systems require monitoring state size, tuning checkpoint intervals, handling key skew, and managing recovery. Teams at companies processing millions of events per second typically dedicate engineers to state management and have runbooks for scenarios like "state grew 10x due to hot keys" or "checkpoint taking 20 minutes instead of 2 minutes."
Hybrid Patterns:
Many production systems use both. A common pattern is stateless preprocessing (filtering, parsing, routing) followed by stateful aggregation. This minimizes the volume of data that hits the stateful stage, reducing state size and checkpoint overhead. For example, filter 10 million raw events per second down to 500 thousand events that need aggregation.
Another pattern is stateful processing for hot path analytics (real time dashboards, fraud detection) and batch processing for cold path (daily reports, model training). The stateful system maintains lightweight aggregates over short windows (5 minutes to 1 hour), while batch jobs recompute from scratch over longer periods (daily, weekly). This balances freshness and cost.
The Cost Trade Off:
Stateful processing runs 24 by 7. You pay for compute and storage continuously. For a pipeline processing 1 million events per second with 10 terabytes of state, you might need 50 to 100 machines running constantly, costing tens of thousands of dollars per month. Batch processing concentrates cost into bursts but adds latency.
Decide based on business value. If detecting fraud 5 minutes faster saves millions in chargebacks, stateful processing pays for itself. If your analytics dashboard can refresh every 10 minutes instead of every second, batch might be the pragmatic choice.
Stateless + External Store
Simple to operate, 5 to 15ms added latency
vs
Stateful Embedded
1 to 3ms local access, complex state management
"If you can tolerate 5 to 15 minute staleness and your logic is batch friendly, run a micro batch job every few minutes. If you need sub second freshness with complex aggregations, accept the operational cost of stateful streaming."
💡 Key Takeaways
✓Choose stateless processing for event level transformations (filtering, routing, format conversion) where you can achieve 5 to 10 million events per second with sub millisecond latency
✓Choose stateful processing when you need joins, aggregations, or session logic across multiple events, and your latency budget is 50 to 500 milliseconds end to end
✓External stores (Redis, database) add 5 to 15 milliseconds of network latency but simplify operations; embedded state stores provide 1 to 3 millisecond access but require managing checkpoints, recovery, and key skew
✓Stateful systems run 24 by 7 and incur continuous compute and storage costs; batch or micro batch jobs reduce cost but increase staleness from seconds to 5 to 15 minutes
✓Hybrid patterns are common in production: stateless preprocessing to reduce volume before stateful aggregation, or stateful hot path for real time combined with batch cold path for historical analysis
📌 Examples
1Payment routing: stateless processor handles 10 million events/sec with 0.5ms p99, queries Redis for user risk scores (adds 8ms), acceptable for non fraud transactions
2Fraud detection: stateful processor maintains per card 5 minute and 24 hour windows, computes features in under 100ms end to end, flags suspicious transactions in real time
3E-commerce analytics: stateless filter reduces 10 million events to 500K, stateful aggregator computes per product click through rates over 1 hour windows, dashboards refresh every 10 seconds