Stream Processing ArchitecturesApache Flink Architecture & State ManagementHard⏱️ ~4 min

Production Implementation: Architecture and Operations

Control Plane and Data Plane Separation: Flink's architecture separates orchestration from execution. The JobManager is the control plane: it receives job submissions, converts logical dataflows into physical execution graphs, schedules tasks to TaskManagers, and coordinates checkpoints. It does not process data. TaskManagers are the data plane. Each TaskManager runs multiple task slots, which are units of parallelism. A TaskManager with 8 CPU cores might be configured with 4 slots, meaning it can execute 4 parallel tasks concurrently. When you submit a job with parallelism 50, the JobManager distributes those 50 tasks across available slots in the cluster. Key Groups and Rescaling: Flink uses key groups as an indirection layer for state partitioning. Instead of directly mapping keys to operator instances, keys are hashed into a fixed number of key groups (configured at job submission, often 128 or in the thousands). These key groups are then assigned to operator instances. When you rescale from parallelism 10 to 20, Flink redistributes key groups, not individual keys. This keeps metadata small and makes rescaling efficient. Each operator instance reads the key groups assigned to it from the previous savepoint, and the system automatically redistributes keys as needed. For example, with 1,280 key groups and parallelism 10, each instance handles 128 key groups. Scaling to 20 means 64 key groups per instance. Flink reads state from the savepoint and redistributes, a process that might take minutes for terabytes of state but does not require rewriting the entire savepoint. Checkpoint Mechanics at Scale: Modern deployments use incremental checkpoints with RocksDB. Instead of uploading all state on every checkpoint, Flink uses RocksDB's built in snapshot feature to identify changed files since the last checkpoint. Only these changed files (often called delta files) are uploaded to S3 or HDFS. For a job with 500 gigabytes of state where 50 gigabytes changed since the last checkpoint, incremental mode uploads 50 gigabytes instead of 500 gigabytes. This reduces checkpoint duration from 10 minutes to under 1 minute, making frequent checkpoints (every 30 to 60 seconds) feasible even at large state sizes. Checkpoints are asynchronous. When a TaskManager receives a barrier, it triggers a local snapshot (using copy on write or filesystem snapshot) and immediately continues processing. The actual upload to remote storage happens in background threads, not blocking the task threads. This keeps checkpoint overhead to 5 to 10 percent of CPU capacity. Savepoints for Blue Green Deployments: Savepoints use the same mechanism as checkpoints but are operator triggered and guaranteed to be compatible across Flink versions (within limits). When you need to upgrade job logic or Flink itself, you trigger a savepoint, stop the job, deploy the new version, and restore from the savepoint. This preserves all state: per user aggregates, window contents, source offsets. The new version resumes processing exactly where the old version stopped. This is critical for stateful jobs where restarting from scratch would lose hours or days of accumulated state. In practice, you test the new version in a staging environment with a copy of production state, validate correctness, then execute the same process in production with minimal downtime (often under 1 minute). Sink Two Phase Commit: To achieve end to end exactly once semantics, sinks must participate in Flink's checkpoint protocol. The sink connector prepares a batch of writes (to a database, message queue, or filesystem), assigning them a checkpoint identifier. It does not commit these writes yet. When the sink receives confirmation that checkpoint N completed, it commits all writes associated with checkpoint N. On recovery, any writes associated with incomplete checkpoints are rolled back or discarded. This ensures that downstream sees each event exactly once, even across failures. Not all systems support this. For example, Elasticsearch does not have transactions aligned with external coordinators. Writing to Elasticsearch from Flink is therefore at least once: on recovery, some events may be written again, creating duplicates. The application must handle deduplication downstream or accept approximate counts.
✓ In Practice: Production teams often run Flink on Kubernetes with autoscaling based on CPU and memory metrics. However, scaling stateful jobs is expensive (requires redistributing state), so autoscaling is typically conservative with long cooldown periods (10 to 30 minutes) to avoid thrashing.
Monitoring and Operational Metrics: Key metrics for production Flink jobs include checkpoint duration (should be under 10 percent of checkpoint interval), backpressure percentage (ideally under 10 percent), busy time per operator (to identify bottlenecks), and lag behind the source (for Kafka, the difference between latest offset and committed offset). When checkpoint duration exceeds 50 percent of the interval, you're at risk of the checkpoint death spiral. When backpressure is consistently above 20 percent, your pipeline capacity does not match input rate. These are early warning signs that require intervention before the job fails.
💡 Key Takeaways
JobManager coordinates scheduling and checkpoints while TaskManagers execute tasks; task slots are units of parallelism, with typical TaskManagers running 4 to 8 slots per machine
Key groups provide indirection between keys and operator instances, enabling rescaling by redistributing groups instead of individual keys, keeping metadata overhead low even with billions of keys
Incremental checkpoints upload only changed RocksDB files (50 GB delta vs 500 GB full state), reducing checkpoint durations from 10 minutes to under 1 minute at large scale
Savepoints enable blue green deployments by preserving all state across job restarts, allowing version upgrades with under 1 minute downtime and no state loss
Two phase commit sinks prepare writes with checkpoint IDs, then commit only after checkpoint completion, providing exactly once semantics but requiring external system support for transactions
📌 Examples
1Job with parallelism 50 and 1,280 key groups scales to parallelism 100 by redistributing 64 key groups to each of 100 new instances, reading state from savepoint in under 5 minutes
2Production deployment on Kubernetes with 20 TaskManagers, each running 4 slots, processing 1 million events per second with incremental checkpoints every 60 seconds uploading 30 GB deltas to S3
3Blue green deployment where fraud detection job is upgraded by taking savepoint at 2 AM, deploying new version, and restoring from savepoint with 45 seconds of downtime while preserving 800 GB of per user state
← Back to Apache Flink Architecture & State Management Overview