Production Patterns: Multi-Cluster HA, Exactly Once Sinks, and Capacity Planning

High availability stream processing across multiple Kafka clusters treats inputs as a union of deduplicated sources. OpenAI's Flink platform reads from multiple primary Kafka clusters simultaneously with custom source logic that continues across cluster failovers without manual intervention. Each event embeds a deterministic ID (for example hash of user ID plus timestamp plus sequence number). The earliest operator deduplicates using a compact expiring key value map or bloom filter plus windowed set, achieving exactly once semantics even with duplicates from multi-primary ingestion. Watchdog services monitor partition changes and automatically rescale pipelines.

Exactly once to external sinks requires transactional semantics or idempotent writes. Transactional sinks like Kafka, PostgreSQL, or databases with two phase commit allow atomic commit of checkpoint barriers and sink writes. Non-transactional sinks (Elasticsearch, metrics stores, notification systems) require application level idempotency: write unique event IDs alongside data and deduplicate on read, or implement a transactional outbox at the source of truth that the stream relays. For side effects like emails, accept at least once and make handlers idempotent (check if email already sent before dispatching).

Capacity planning uses concrete numbers: stateless pipelines sustain 100,000+ events per second per few dozen vCPUs with sub-10ms operator latency. Stateful jobs with windowing run an order of magnitude lower, around 10,000 to 50,000 events per second per similar hardware. A 100 node mid-range cluster costs tens of thousands of dollars per month and processes tens of billions of events per day with second level windows. Checkpoint overhead grows with state size; keep per-task state under tens of GB when possible to maintain checkpoint durations under 30 seconds at p99. Monitor consumer lag, watermark lag, and checkpoint completion rates as key Service Level Indicators (SLIs).

💡 Key Takeaways

✓Multi-cluster high availability ingests from multiple Kafka primaries simultaneously with deduplication via deterministic event IDs and bloom filters plus windowed key value maps (500 MB state, 1 hour Time to Live typical)

✓Exactly once to external sinks requires transactional sinks (Kafka, PostgreSQL with two phase commit) or application idempotency with unique event IDs; side effects like emails need handler idempotency checks

✓Capacity planning baselines: stateless 100k+ events per second per few dozen vCPUs; stateful windowed 10k to 50k events per second; 100 node cluster processes tens of billions events per day at tens of thousands dollars per month

✓Keep per-task state under tens of GB to maintain checkpoint durations under 30 seconds at p99; state growth directly increases checkpoint time and recovery debt on failure

✓Key Service Level Indicators (SLIs) include consumer lag (offset delta), watermark lag (event time delta), checkpoint completion rate (successful per interval), and end to end latency (event time to sink write)

📌 Interview Tips

1OpenAI's Flink platform reads from multiple Kafka clusters with embedded event IDs, deduplicates with a 1 hour windowed set, and automatically rescales when partitions change, achieving continuous operation through cluster failovers

2A metrics aggregation pipeline writes to a non-transactional time series database using event ID plus timestamp as unique key, allowing at least once delivery with idempotent upserts and deduplication on query

← Back to Stream Processing (Flink, Kafka Streams) Overview