Schema Design and Query Optimization Patterns

Primary Key Selection: The Most Critical Decision

Your primary key determines both storage layout and query performance. ClickHouse physically sorts data on disk by the primary key tuple. This means queries with filters matching the primary key prefix can skip massive amounts of data using the sparse index, while queries on other columns must scan everything.

Consider a logs table for a multi tenant SaaS. Two schema options:

Option A: PRIMARY KEY (timestamp, tenant_id, endpoint)
Option B: PRIMARY KEY (tenant_id, timestamp, endpoint)

Option A is optimized for time range queries across all tenants, like "all 500 errors in the last hour." The sparse index can jump directly to the time range.

Option B is optimized for per tenant queries, like "show me tenant X's traffic for the last day." The sparse index narrows to tenant X's data first, then scans within that tenant's time range. For multi tenant systems where 90% of queries filter by tenant, Option B can be 100x faster because it reads 1/Nth of the data (where N is number of tenants).

"Primary key order is a commitment to a query pattern. Choose wrong, and you'll scan terabytes for queries that should read gigabytes."
Partitioning Strategy

Partitioning splits tables into independent chunks, each stored as separate directories. Most teams partition by time (day, week, or month) using the PARTITION BY toYYYYMMDD(timestamp) clause. This enables fast partition pruning: queries with time filters only touch relevant partitions.

Over partitioning is a common mistake. If you partition by hour and by customer, and you have 1000 customers, that's 24,000 partitions per day. Each partition has its own set of parts, multiplying merge overhead. Queries that span partitions must coordinate across thousands of directories.

Rule of thumb: keep active partitions (those receiving writes) under 100 to 200 total. For most workloads, daily partitions are sufficient. Use weekly or monthly partitions if write volume is moderate. Only partition by additional dimensions (like customer ID) if you regularly drop entire partitions for data lifecycle, like "delete all data for customer X."

Compression and Codecs

ClickHouse allows per column compression codecs. Choosing the right codec can reduce storage by 3x to 10x beyond default compression.

For timestamp columns, use Delta or DoubleDelta codecs. These store differences between consecutive values, which compress extremely well for monotonic time series. For low cardinality strings (status codes, country codes), use LowCardinality(String) wrapper, which applies dictionary encoding. For numeric metrics with small deltas, Gorilla codec works well.

Column Type
Codec
Compression Ratio
Timestamp
Delta
10x to 20x
Status Code
LowCardinality
5x to 15x
Latency (float)
Gorilla
3x to 8x
Materialized Views for Pre Aggregation

For queries with expensive aggregations over high cardinality dimensions, materialized views can pre compute results incrementally. Define a target table with coarser granularity, then create a materialized view that aggregates inserts on the fly.

Example: Raw table has billions of events with user_id and action. Dashboard needs daily active users by action type. Create a materialized view that maintains a table with (date, action, COUNT(DISTINCT user_id)), updated on each insert. Now dashboard queries read from this small aggregated table instead of scanning billions of raw rows.

The tradeoff is write amplification: each insert to the raw table triggers an insert to the materialized view target. For 5 materialized views, you write 6 times per row. This is acceptable when read QPS far exceeds write QPS.

Query Optimization Techniques

Always push filters down to the WHERE clause, especially on primary key columns and partition key. ClickHouse can prune partitions and skip granules only if filters are explicit in WHERE, not in HAVING or subqueries.

For joins, ClickHouse performs best with small dimension tables (under 10 million rows) that fit in memory. Use JOIN type carefully: ALL INNER JOIN is fastest, ANY LEFT JOIN for deduplication. Avoid large to large table joins; instead, denormalize or use dictionaries.

Limit GROUP BY cardinality. If grouping by high cardinality keys like user_id, add aggressive time filters to reduce working set. Monitor system.query_log for queries with high memory usage (over 50% of node RAM) and refactor them.

💡 Key Takeaways

✓Primary key order determines physical sort on disk; queries filtering by primary key prefix can skip 90%+ of data via sparse index

✓Partition by time (daily or weekly) to enable partition pruning; avoid over partitioning (keep active partitions under 100 to 200)

✓Use specialized codecs per column: Delta for timestamps (10x to 20x compression), LowCardinality for enums, Gorilla for metrics

✓Materialized views pre aggregate data incrementally, trading write amplification (6x writes for 5 views) for fast dashboard queries

✓Push filters to WHERE clause on primary key and partition key columns; ClickHouse cannot prune data from HAVING or subquery filters

📌 Interview Tips

1Multi tenant logs with <code>PRIMARY KEY (tenant_id, timestamp)</code> allows per tenant queries to read only 1/1000th of total data

2Daily partition with 100 million rows per day vs hourly partition with 4 million rows per hour: daily is simpler and reduces merge overhead by 24x

3Materialized view on raw events table pre computes daily active users, reducing dashboard query time from 5 seconds (scanning 10 billion rows) to 50ms (scanning 1000 pre aggregated rows)

← Back to ClickHouse Architecture & Performance Overview