Multi Region Patterns: Active Passive vs Active Active Replication

Two Multi-Region Patterns:

Multi-region event streaming architectures split into two patterns: active-passive (single writer region with async disaster recovery replication) and active-active (independent regional writers with conflict resolution).

Active-Passive
Simpler, guarantees ordering, 5-30min failover
vs
Active-Active
Regional autonomy, low local latency, handles duplicates
Active-Passive Details:

One region handles writes, asynchronously replicating to a standby region with 50 to 200ms WAN lag. Failover is manual with runbooks to promote the standby, typically taking 5 to 30 minutes. This pattern guarantees ordering and avoids conflicts but concentrates risk and increases latency for users far from the primary region.

Active-Active Details:

Each region writes independently, improving regional autonomy and write latency for local users. Consumers read a union of sources across all primaries, requiring stream processors to handle duplicate events and occasional gaps during failovers. OpenAI built a union-of-sources connector pattern with watchdog services that detect topology changes and auto-scale stream jobs.

Failure Mode Differences:

Active-passive risks extended unavailability if the primary region fails and manual failover is slow. Active-active risks subtle data quality issues: duplicate events from overlapping unions, out-of-order delivery when primaries have clock skew, and gaps when a primary's replication lags. Choose active-passive for correctness-critical domains (financial transactions, orders). Choose active-active for availability-critical, idempotent workloads (logging, metrics, telemetry).

💡 Key Takeaways

✓Active passive adds 50 to 200 millisecond wide area network latency to replication lag: consumers in the standby region read stale data. Failover requires promoting the standby and redirecting producers, typically taking 5 to 30 minutes with manual runbooks or 1 to 5 minutes with automated orchestration.

✓Active active union of sources requires consumers to handle duplicates: use natural idempotency keys (order_id, transaction_id) and deduplicate at the sink. Monitor per source watermarks to detect when one primary falls behind and contributes incomplete data.

✓Conflict resolution in active active is domain specific: last write wins with vector clocks works for eventually consistent state; operational transforms (commutative operations like counters) work for aggregations; some domains require manual reconciliation (two regions process the same payment).

✓Cross region replication saturates bandwidth quickly at scale: 1,000 events per second at 10 KB each is 10 MB/s or 80 Mbps per topic. With 100 topics and replication factor 2 (bidirectional), you need 16 Gbps sustained wide area network capacity. Cost and throttling are real constraints.

✓Uber built uReplicator for tens of thousands of topics: per topic throttling, offset awareness for exactly once regional replay, and failure isolation prevent a single hot topic from saturating cross datacenter links. LinkedIn uses mirroring with automated lag monitoring and circuit breakers to pause replication under backlog, preventing cascade failures.

📌 Interview Tips

1A financial services firm uses active passive for payment topics. Writes go to the US primary; Europe reads with 100 to 150 millisecond staleness. During a US region outage, operators manually promote Europe to primary within 10 minutes, accepting a brief write outage to preserve exactly once semantics.

2OpenAI runs multiple primary Kafka clusters for high availability. Stream processors consume a union of all primaries, dynamically adding new sources as topology changes. They explicitly accept rare duplicates and gaps, relying on idempotent sinks (upserts with unique constraints) to deduplicate.

3Netflix mirrors personalization events across regions with active passive replication. The primary region (US West) handles all writes; other regions replicate for local read latency. During regional failures, they accept degraded user experience (stale recommendations) rather than failing over and risking global ordering loss.

← Back to Kafka/Event Streaming Architecture Overview