Learn→Stream Processing Architectures→Apache Flink Architecture & State Management→2 of 6

Stream Processing Architectures • Apache Flink Architecture & State ManagementMedium⏱️ ~3 min

Flink State Architecture: Keyed State, Operators, and Backends

Two Scopes of State:
Flink organizes state into two fundamental types based on scope. Keyed state is partitioned by a key field like user_id or session_id. Each key has isolated state, meaning the fraud detection logic for user 12345 cannot accidentally access or corrupt state for user 67890. This isolation is crucial for correctness and enables parallel processing.

Operator state is scoped to a specific operator instance, not to keys. Common uses include tracking source offsets (which Kafka partition offset has been read), buffering records before a batch write, or maintaining broadcast state shared across all keys.

Why This Matters at Scale:
With 100 million users generating events, you cannot fit all state on one machine. Keyed state solves this by distributing state across the cluster. If you run a job with 50 parallel operator instances, the key space is divided into 50 partitions. Events with user_id 12345 always route to the same partition, ensuring that machine has the complete history for that user.

The math is straightforward. If each user's state averages 10 kilobytes (recent transactions, computed features, session data), 100 million users means 1 terabyte total. Distributed across 50 machines, that's 20 gigabytes per machine, easily manageable in memory or on local SSD.

State Distribution Example
1 TB
TOTAL STATE
50
MACHINES
20 GB
PER MACHINE
State Backends: The Storage Layer:
The state backend determines where and how state is physically stored. The memory heap backend keeps everything in Java heap memory. Access is fastest (microseconds), but you're limited by available RAM and garbage collection can cause latency spikes. This works well for small state sizes under 10 gigabytes per TaskManager.

The RocksDB backend uses an embedded Log Structured Merge tree (LSM tree) database that stores state on local disk, with recent data cached in memory. This supports terabytes of state per job because disk capacity is much cheaper than RAM. The tradeoff is higher latency: reads often hit disk at millisecond latencies instead of microseconds. Write amplification from LSM compaction also increases I/O load.

Key Groups for Rescaling:
Flink doesn't directly map keys to operator instances. Instead, keys are hashed into a fixed number of key groups (often 128 or thousands), and key groups are assigned to operator instances. When you scale from 10 to 20 parallel instances, Flink redistributes key groups, not individual keys. This indirection keeps rescaling metadata small and redistribution efficient.

⚠️ Common Pitfall: Choosing memory backend for production without measuring actual state size. A job that looks fine in testing with 1 million users can exhaust heap memory when deployed against 100 million users, causing out of memory crashes.

💡 Key Takeaways

✓Keyed state is partitioned by key (like <code>user_id</code>), ensuring each key's state lives on one machine for consistent access and enabling linear scaling across the cluster

✓Operator state is per instance, used for source offsets, buffers, and broadcast data shared across all keys processed by that operator

✓Memory backend offers microsecond access but limited capacity (under 10 GB per TaskManager), while RocksDB supports terabytes with millisecond latency and higher I/O overhead

✓Key groups provide an indirection layer between keys and operator instances, enabling efficient rescaling by redistributing groups instead of individual keys

✓State size planning is critical: 100 million users at 10 KB per user equals 1 TB total, requiring capacity planning for both working set in memory and checkpoint storage

📌 Interview Tips

1Fraud detection maintaining a ValueState per credit card with recent transaction amounts and locations, accessed in microseconds when new transactions arrive

2Session window aggregation keeping MapState per user_id with event timestamps, where each entry represents an event in the current session window

3Kafka source operator using ListState (operator state) to track committed offsets for each partition, ensuring exactly once processing across restarts

← Back to Apache Flink Architecture & State Management Overview