Real-time Analytics & OLAPClickHouse Architecture & PerformanceMedium⏱️ ~3 min

Distributed Architecture: Sharding, Replication, and Query Execution

Cluster Topology: Shards and Replicas A production ClickHouse deployment uses both sharding for horizontal scale and replication for high availability. Sharding means splitting the full dataset across multiple independent nodes (shards), where each shard holds a disjoint subset of data. Replication means maintaining multiple copies of each shard's data on different physical machines. A common pattern is N shards with 2 to 3 replicas per shard. For example, a 12 node cluster might be configured as 6 shards times 2 replicas. Each shard handles 1/6th of the total data, and if one replica fails, queries automatically route to the other replica of that shard. This gives you both throughput scaling (6 shards can ingest and query in parallel) and fault tolerance. Data is partitioned across shards using a sharding key, typically a hash of tenant ID, customer ID, or another high cardinality dimension. For time series data, you might shard by a combination of entity hash and time range to balance data distribution. Replication uses ZooKeeper or ClickHouse Keeper to coordinate which parts exist on each replica and to orchestrate background fetches of missing parts.
Cluster Configuration Example
6 shards
PARTITIONS
2 replicas
PER SHARD
12 nodes
TOTAL
Distributed Query Execution When a client sends a query to a ClickHouse cluster, it hits a coordinator node (often any node can act as coordinator). The coordinator parses the query, determines which shards need to participate, and sends sub queries to one replica from each relevant shard. Each shard processes its local data: applies filters, aggregates results, and streams partial results back to the coordinator. For a query like "top 100 URLs by request count in the last hour," each shard returns its local top 100. The coordinator then merges these partial results, sorts globally, and returns the final top 100. This scatter gather pattern is efficient when queries are selective (partitioning and primary key filters eliminate most data) and when aggregations can be partially computed. For example, SUM and COUNT aggregate incrementally, so combining results from 6 shards just means adding 6 numbers. More complex operations like percentile calculations require more coordination. At scale, companies like Uber run clusters with 50 to 200 nodes. With well designed partitioning, a query touching 24 hours of data might scan only 20% of shards (the ones holding relevant time partitions), and each shard scans only a subset of its data thanks to primary key filtering. This allows p50 query latency around 50 to 300 milliseconds and p99 under 2 seconds for interactive dashboards. Replication and Consistency Replication in ClickHouse is asynchronous. When a write completes on one replica, background processes replicate parts to other replicas over the next seconds to minutes. This means different replicas can be slightly out of sync. Queries might see data on one replica that hasn't yet appeared on another. ZooKeeper or ClickHouse Keeper tracks metadata: which parts exist, replication queues, and table schemas. If a replica falls behind, it fetches missing parts from peers. If a node fails during a query, the coordinator can retry that shard's portion on another replica. However, if the coordination service itself becomes unavailable or overloaded, replication and distributed operations can stall.
💡 Key Takeaways
Sharding splits data across nodes by hashing a key (customer ID, tenant), allowing parallel ingestion and query execution across all shards
Typical topology: N shards with 2 to 3 replicas per shard, balancing throughput scaling and fault tolerance
Distributed queries use scatter gather: coordinator sends sub queries to each shard, shards return partial results, coordinator merges and finalizes
Replication is asynchronous and coordinated by ZooKeeper or ClickHouse Keeper, meaning replicas can be slightly out of sync (eventual consistency model)
At companies like Uber and Cloudflare, clusters of 50 to 200 nodes sustain millions of events per second ingestion with query latency p50 of 50 to 300ms
📌 Examples
1A 12 node cluster configured as 6 shards times 2 replicas can ingest 6 million rows per second (1 million per shard) and query in parallel across all shards
2Query for top 100 endpoints by traffic: each of 6 shards returns its local top 100, coordinator merges to final global top 100
3If one replica fails, queries automatically route to the surviving replica of that shard, maintaining availability
← Back to ClickHouse Architecture & Performance Overview