Stream Processing Architectures • Apache Flink Architecture & State ManagementEasy⏱️ ~3 min
What is Apache Flink? Understanding Stateful Stream Processing
Definition
Apache Flink is a distributed stream processing engine that processes unbounded data streams in real time while maintaining large amounts of consistent state across a cluster of machines.
user_id or account_id) to specific machines, and each machine maintains state only for its assigned keys. With 100 million users and 50 machines, each machine handles roughly 2 million user states.
The Critical Feature:
What makes this production ready is Flink's checkpoint mechanism. Every 30 to 300 seconds, the system takes a consistent snapshot of all state across all machines and writes it to durable storage like HDFS or S3. If a machine fails, Flink restarts the affected tasks on healthy machines and restores their state from the latest checkpoint. You get both speed and reliability.
This architecture enables applications that need both history and freshness: real time recommendation features, session analytics, per user rate limiting, or anomaly detection where sub second latency matters.💡 Key Takeaways
✓Flink processes unbounded event streams in real time, maintaining state locally on TaskManagers for microsecond access instead of millisecond database round trips
✓State is partitioned by keys like <code>user_id</code> across machines, enabling horizontal scaling to billions of events per day while preserving per key consistency
✓Checkpoints create consistent snapshots every 30 to 300 seconds to durable storage, enabling recovery from machine failures without losing progress
✓Unlike batch processing (minutes of delay) or stateless streaming (slow external lookups), Flink combines low latency with large state capacity
✓Production deployments at companies like Alibaba process over 10 million events per second with p99 latencies under 500 milliseconds
📌 Examples
1Fraud detection system processing 2 million credit card transactions per second, keeping recent transaction history for each card in local state to flag anomalies within 100 milliseconds
2Real time recommendation engine maintaining user behavior features across 100 million users, updating features as clickstream events arrive and serving them to ranking models with sub second freshness
3Session analytics aggregating user actions into sessions, handling out of order events and late arrivals while producing accurate counts even when mobile apps send events hours after they occurred