Failure Modes and Production Challenges

Merge Storms and Part Proliferation

The most common production failure mode in ClickHouse is merge backlog, where the system accumulates too many small parts faster than background merges can consolidate them. This happens when engineers insert tiny batches (hundreds or even individual rows) at high frequency, or when they create excessive partitions.

Symptom: query latency suddenly spikes from 200 milliseconds to 10+ seconds. Each query must open and scan hundreds or thousands of small parts instead of a few large merged parts. CPU spends more time on file IO and metadata operations than actual data processing.

Merge Crisis Timeline
NORMAL
100 parts
→
OVERLOAD
2000+ parts
→
CRISIS
Queries fail
The fix: Batch inserts into blocks of 10,000 to 50,000 rows. Limit partitions to practical ranges (daily or weekly for most workloads, not hourly or per customer). Monitor system.parts table and alert when parts per partition exceed 300. Tune merge settings like max_bytes_to_merge_at_max_space_usage to be aggressive during off peak hours.

Hot Partition and Cardinality Explosion

Another failure mode occurs when queries with high cardinality group by clauses create massive intermediate aggregation states that exhaust memory. For example, grouping by user_id over 10 billion rows where there are 500 million unique users means holding 500 million partial aggregation states in memory.

If the GROUP BY key isn't aligned with the primary key sort order, ClickHouse can't stream and flush partial results. It must buffer everything until the full scan completes. On a node with 128 GB of RAM, this can cause out of memory errors or trigger disk spilling, which degrades performance by 10x or more.

Mitigation: Design primary keys to match common query patterns. If you frequently group by customer_id, include it early in the primary key tuple. Use pre aggregation tables (materialized views that incrementally roll up data) for known high cardinality dimensions. Limit query scope with tight time filters to reduce working set size.

Replication Lag and Split Brain

Because replication is asynchronous, network partitions or slow replicas can lead to significant lag. One replica might be minutes behind during heavy ingestion. Queries routed to the lagging replica see stale data, confusing users or breaking dashboards that assume recent data is visible.

Worse, if ZooKeeper or ClickHouse Keeper becomes unavailable or experiences network partitions, replication queues freeze. New writes may succeed on one replica but not propagate to others. Manual intervention is required to reconcile parts, and in rare cases, data can be duplicated or lost if parts are incorrectly fetched or dropped.

❗ Remember: Always monitor replication lag (system.replication_queue) and ZooKeeper health. Set up alerts for lag over 60 seconds or queue depth over 1000 entries.
Schema Evolution Pain Points

Altering schemas in ClickHouse is tricky at scale. Adding a column is fast (metadata only for nullable columns with defaults), but changing primary key or partition key requires rebuilding the entire table. For tables with terabytes of data, this means hours of downtime or complex blue green migration strategies.

Deletes and updates are also problematic. They're implemented as mutations that rewrite parts. A DELETE WHERE user_id = 'x' touching 1% of rows across thousands of parts can take hours to complete and cause sustained CPU and IO load. Heavy mutation workloads can cascade into merge backlog, creating a vicious cycle.

When ClickHouse Breaks Down:

ClickHouse is the wrong choice when you need single row, low latency lookups (use DynamoDB or Redis). It's wrong when you need strong consistency and multi row transactions (use Postgres or Spanner). It's wrong when schema changes frequently and unpredictably (use document stores like MongoDB). And it's wrong when your team lacks the operational experience to tune merges, replication, and query patterns.

💡 Key Takeaways

✓Merge storms occur when inserts create parts faster than background merges consolidate them, causing query latency to spike from 200ms to 10+ seconds

✓High cardinality GROUP BY on dimensions not in primary key can exhaust memory with millions of aggregation states; align primary key with query patterns

✓Asynchronous replication means replicas can lag by seconds to minutes; queries on lagging replicas see stale data, confusing users

✓Schema changes like altering primary key require full table rebuilds; for terabyte scale tables, this means hours of migration effort

✓Heavy update or delete workloads implemented as mutations can cause sustained merge pressure and degrade ingestion throughput by 50% or more

📌 Interview Tips

1Production incident: engineer inserts 1 row per API call instead of batching, creating 10,000 parts per hour; query latency jumps from 300ms to 30 seconds

2Query with <code>GROUP BY user_id</code> over 10 billion rows with 500 million unique users exhausts 128 GB RAM and triggers out of memory error

3Network partition causes replica lag of 5 minutes; dashboard queries show inconsistent data depending on which replica answers

← Back to ClickHouse Architecture & Performance Overview