Database DesignDistributed SQL (CockroachDB, Spanner)Medium⏱️ ~2 min

Production Architecture and Integration Patterns

In production systems at companies like Google, Uber, and Coinbase, distributed SQL sits at the core of a larger architecture. Stateless application servers in multiple regions connect to a multi region distributed SQL cluster. For example, a global fintech application might have API gateways in North America, Europe, and Asia, each forwarding to regional application clusters. Those clusters talk to a CockroachDB or Spanner deployment with data partitioned by customer or account ID and replicated across regions. Caching layers work closely with distributed SQL but require careful invalidation strategies. Redis or in memory application caches absorb hot read traffic, serving requests in under 1 millisecond compared to 5 to 10 milliseconds for database reads. However, cache invalidation introduces complexity. After a write commits to the database, the cache may remain stale for seconds or minutes depending on Time To Live (TTL) settings. Applications must decide whether to tolerate stale reads, implement active invalidation on writes, or use cache aside patterns where misses populate the cache. Data pipelines stream changes from distributed SQL into analytics warehouses and real time processing systems. Change Data Capture (CDC) mechanisms export transaction logs to Kafka or similar message buses. From there, stream processors like Flink or Spark transform and aggregate data, loading results into columnar stores for business intelligence queries. This decouples analytical workloads from transactional systems, preventing heavy reporting queries from impacting user facing latencies. Orchestration and observability are critical operational components. Kubernetes or similar platforms manage database pods, handling node failures by rescheduling replicas and maintaining desired replica counts. Monitoring systems track consensus health metrics like leader lease duration, committed log lag measured in entries or bytes, transaction restart rates indicating contention, and lock wait times showing hotspots. When these metrics degrade, automated runbooks relocate replicas, rebalance ranges, or alert operators. At scale, teams monitor thousands of ranges across hundreds of nodes, requiring sophisticated aggregation and alerting to detect issues before they impact users.
💡 Key Takeaways
Multi region architecture: Regional API gateways forward to application clusters, which connect to distributed SQL with data partitioned by customer ID and geo replicated
Caching absorbs hot reads (under 1ms from Redis vs 5 to 10ms from database) but requires invalidation strategies to handle stale data after writes
Change Data Capture streams transaction logs to message buses like Kafka, enabling real time analytics and decoupling OLAP from OLTP workloads
Kubernetes orchestration manages database pods and handles failures by rescheduling replicas, maintaining replica counts across zones and regions
Observability tracks consensus health (leader lease, log lag), contention (transaction restarts, lock waits), and enables automated rebalancing before user impact
📌 Examples
Coinbase architecture: Regional app clusters in US and EU connect to CockroachDB with account data partitioned by user ID, replicated across 3 zones per region
Cache invalidation pattern: After account balance update commits in 50ms, cache entry invalidated, next read in 5ms from replica updates cache for subsequent 1ms hits
CDC pipeline: CockroachDB changefeed streams transaction events to Kafka, Flink aggregates balances for fraud detection, results stored in ClickHouse for dashboards
Monitoring metrics: Leader lease duration spikes from 9s to 3s indicating frequent leader changes, automated rebalancing moves hot ranges to less loaded nodes
← Back to Distributed SQL (CockroachDB, Spanner) Overview
Production Architecture and Integration Patterns | Distributed SQL (CockroachDB, Spanner) - System Overflow