State Management and Checkpointing at Scale

State Storage Architecture:
Each processing task maintains local state in an embedded key value store, typically RocksDB, backed by local Solid State Drives (SSDs). Above this sits an in memory cache for hot keys. This two tier approach provides 1 to 3 millisecond read and write latency without network calls. For a system processing 1 million events per second across 100 tasks, each task might hold 5 to 20 gigabytes of physical state on disk, with several hundred megabytes cached in memory.

The total logical state can be enormous. A user behavior pipeline at a large consumer company might maintain 30 days of per user aggregates across 100 million users. At 1 kilobyte per user, that is 100 gigabytes of state, spread across hundreds of machines. The key insight is that state is partitioned by key: events for user 12345 always route to the same task, which holds only that user's state locally.

The Checkpointing Mechanism:
Checkpointing ties input offsets to state snapshots. At checkpoint time, the system picks a consistent set of input positions across all partitions, flushes in memory state to the local store, and records metadata mapping partitions and offsets to state files.

Here is what happens during a checkpoint:

1
Barrier injection: The coordinator injects checkpoint barriers into the input streams at consistent positions across all partitions.
2
State flush: Each task flushes in memory state to local disk when it receives the barrier, ensuring durability.
3
Metadata recording: The system records input offsets and state file locations, creating a consistent recovery point.
4
Acknowledgment: Tasks acknowledge checkpoint completion, and the coordinator marks it as successful.
Checkpoint Frequency Trade Off
Every 10s
FAST RECOVERY
Every 5min
LESS OVERHEAD

Checkpoint frequency is a trade off. Frequent checkpoints (every 10 seconds) mean fast recovery: if a task fails, it only replays 10 seconds of input. But frequent checkpoints add overhead. Flushing gigabytes of state to disk every 10 seconds can saturate I/O. Less frequent checkpoints (every 5 minutes) reduce overhead but increase recovery time: a failure means replaying 5 minutes of events.

Changelogs and Recovery:
Many systems also write state changes to a distributed log called a changelog. Every update to local state gets appended to this log. On recovery, a task can restore state by replaying the changelog instead of reprocessing the entire input stream. This is faster: replaying a changelog is a sequential read of compact state updates, while replaying input might mean filtering and aggregating millions of events.

For high availability, some systems maintain warm standbys that continuously consume the changelog, keeping a near real time replica of state. When the primary fails, the standby can take over in seconds with minimal replay, achieving recovery times under 10 seconds instead of minutes.

⚠️ Common Pitfall: If state grows unbounded (for example, keeping per user data forever without Time To Live policies), checkpoints take longer, recovery slows down, and disk fills up. A system that worked fine at 10 million users can fail catastrophically at 100 million users.

💡 Key Takeaways

✓State is stored locally in embedded key value stores (typically RocksDB) on local SSDs, with in memory caches providing 1 to 3 millisecond access times

✓Checkpoints create consistent snapshots by injecting barriers into input streams, flushing state to disk, and recording metadata that ties input offsets to state files

✓Checkpoint frequency is a trade off: every 10 seconds enables fast recovery but adds I/O overhead; every 5 minutes reduces overhead but increases recovery time from seconds to minutes

✓Changelogs (distributed logs of state updates) enable faster recovery than replaying input, and warm standbys that consume changelogs can take over in under 10 seconds

✓State must have retention policies (Time To Live limits) or it grows unbounded, eventually causing checkpoint failures, slow recovery, and disk exhaustion as user counts grow from millions to hundreds of millions

📌 Interview Tips

1System processing 1 million events/sec across 100 tasks holds 5 to 20 GB state per task, checkpoints every 3 minutes, recovers from failure in under 2 minutes

2Payment fraud system maintains 24 hour rolling aggregates with 1 minute Time To Live on old windows, preventing state from growing beyond bounded memory footprint

3High availability setup uses warm standbys consuming changelog, achieving under 10 second failover when primary task fails

← Back to Stateful Stream Processing Overview