Learn→Real-time Analytics & OLAP→ClickHouse Architecture & Performance→2 of 6

Real-time Analytics & OLAP • ClickHouse Architecture & PerformanceMedium⏱️ ~3 min

MergeTree Storage Engine: How Write and Read Paths Work

The Write Path: Immutable Parts and Background Merges

When data arrives in ClickHouse, it's not written directly to a single file. Instead, each batch of rows becomes an immutable part. Think of a part as a sorted, compressed set of column files on disk. If you insert 100,000 rows, that becomes one part. Insert another 50,000 rows, that's a second part.

Each part is sorted by the table's primary key. For example, if your primary key is (timestamp, customer_id, event_type), all rows within that part are physically ordered by timestamp first, then customer ID, then event type. This sorting is crucial for query performance.

The challenge is that accumulating thousands of small parts would make queries slow, since each query would need to open and scan thousands of files. ClickHouse solves this through background merge processes. These continuously combine small parts into larger ones, maintaining the sort order and consolidating data. A well tuned system keeps parts per partition under 100 to 300, even under sustained write load.

Typical Ingestion Pipeline
BUFFER
1000s rows
→
INSERT BATCH
1 part
→
BACKGROUND
Merge
The Read Path: Sparse Indexes and Vectorized Execution

ClickHouse doesn't build a dense index like traditional databases where every row has an index entry. Instead, it creates a sparse primary index with one entry per 8,192 rows (the granule size). When a query filters by WHERE customer_id = 5, ClickHouse uses this sparse index to identify which granules might contain that customer ID, then reads only those granules from disk.

Once data is read into memory, the vectorized execution engine processes it in blocks of 4,000 to 64,000 values per column. Each operation like filtering, aggregating, or joining processes entire blocks at once, not row by row. This keeps CPU pipelines full and improves cache locality.

For a query like "sum of revenue by country over 10 billion rows," ClickHouse reads only the country and revenue columns, skips granules where the sparse index proves no matching rows exist, decompresses blocks in parallel, and aggregates using vectorized instructions. On modern hardware with NVMe SSDs, this can scan several billion rows per second.

✓ In Practice: At Cloudflare, ClickHouse ingests millions of HTTP request logs per second, writing batches of 10,000 to 50,000 rows per insert. Background merges keep active partitions at 100 to 200 parts, allowing query latency to stay under 500ms at p99.

💡 Key Takeaways

✓Each insert creates an immutable part sorted by primary key, allowing parallel writes without locking

✓Background merges consolidate small parts into larger ones, keeping the number of parts manageable (typically 100 to 300 per partition)

✓Sparse primary index stores one entry per 8,192 rows, allowing queries to skip entire ranges of data without reading from disk

✓Vectorized execution processes blocks of thousands of values at once, achieving scan rates of several billion rows per second on modern hardware

✓Compression codecs (delta, dictionary, Gorilla) typically achieve 5x to 15x compression ratio, so 10 TB of SSD can store 50 to 150 TB of raw data

📌 Interview Tips

1Batch insert of 50,000 events becomes one part with column files sorted by <code>(timestamp, customer_id)</code>

2Query with <code>WHERE customer_id = 'acme_corp'</code> uses sparse index to identify relevant granules, reading perhaps 0.1% of total data

3A single node can scan 5 billion rows in memory per second during aggregation queries with vectorized execution

← Back to ClickHouse Architecture & Performance Overview