Stateful Stream Processing: Local State, Checkpoints, and Fault Tolerance

Stateful stream processing maintains per-key or per-operator state co-located with the compute that processes those keys. This enables aggregations, joins, deduplication, and temporal pattern detection without remote database lookups. An embedded key value store keeps state local, achieving operator latencies of 10 to 100 milliseconds even with hundreds of gigabytes of state per task. State is partitioned alongside the data stream, so all events for user_123 and all state for user_123 live on the same worker.

Fault tolerance comes from periodic consistent snapshots called checkpoints. The system captures in-flight stream positions (offsets) plus a snapshot of all operator state atomically, writing to durable storage. On failure, jobs restart from the last completed checkpoint, replaying from those offsets. Typical production checkpoint intervals are 30 to 300 seconds. State is also backed by durable changelogs (compacted Kafka topics in Kafka Streams or RocksDB snapshots in Flink), enabling fast local recovery by replaying recent changes instead of downloading full snapshots.

Exactly once end to end semantics require coupling checkpoint barriers with transactional or idempotent sink writes. The checkpoint barrier flows through the pipeline; only after all operators snapshot their state and the sink transaction commits does the checkpoint complete. Without transactional sinks, you have at least once delivery and must design idempotent handlers with deduplication keys. Two phase commit adds latency and can reduce throughput under contention, so many production systems accept at least once to fast sinks like metrics or notifications and handle duplicates downstream.

💡 Key Takeaways

✓Local state co-location eliminates remote lookups; typical production jobs maintain hundreds of gigabytes to multi-terabyte state per job with p99 operator latency of 10 to 100 milliseconds

✓Checkpoints provide fault tolerance by snapshotting stream positions and operator state every 30 to 300 seconds; recovery replays from last checkpoint, trading frequency against overhead

✓Exactly once end to end requires transactional sinks or two phase commit; otherwise use at least once with idempotent writes and deduplication keys like event IDs

✓Large state slows checkpoints and recovery; a job with 100 GB state per task can take 1 to 30 seconds to checkpoint and minutes to recover, accruing consumer lag during downtime

✓Changelogs enable fast recovery by replaying recent changes instead of downloading full snapshots; Kafka Streams uses compacted topics, Flink uses RocksDB incremental snapshots

📌 Interview Tips

1OpenAI's Flink platform stores durable state in blob storage decoupled from compute; jobs automatically restart on healthy clusters during outages with checkpoint intervals tuned to balance recovery time and overhead

2Amazon Kinesis Data Analytics targets checkpoint durations under 30 seconds at p99 for production workloads processing billions of events per hour, using incremental checkpoints for large state

← Back to Stream Processing (Flink, Kafka Streams) Overview