Columnar Storage Failure Modes and Edge Cases

Write Amplification and Compaction Backlog:

One of the most common production issues with columnar storage is compaction falling behind under high update or delete load. When the workload has frequent updates or deletes, the engine typically writes small delta segments or delete markers rather than rewriting large columnar files immediately. Background compaction processes then merge and rewrite these into consolidated columnar segments.

If compaction falls behind, query performance degrades substantially. The query engine must read many small segments and apply more delete filtering at read time. In cloud systems, this also increases object count and metadata overhead, which can become a bottleneck. A production Snowflake deployment might see query latency increase from 2 seconds to 15 seconds when compaction lags and the number of micro partitions per table grows from 5,000 to 50,000.

⚠️ Common Pitfall: Teams often underestimate compaction cost when migrating write heavy workloads to columnar storage. What looks like a 10,000 row per second insert rate actually generates 50,000 compaction writes per second in the background, exhausting I/O capacity.
Misaligned Sort and Partition Keys:

Columnar systems rely heavily on clustering similar values together to maximize compression and enable effective data skipping. If you choose a poor sort key, for example sorting by a high cardinality random identifier like user UUID instead of timestamp or customer ID, min and max statistics per row group become nearly useless.

Consider a table with 10 billion events sorted by random UUID. Every row group will have a min UUID close to zero and a max UUID close to the maximum, so a filter on WHERE event_time BETWEEN X AND Y cannot prune any row groups using time based statistics. The theoretical advantage of columnar layout is not realized, and you end up scanning the entire dataset.

This is a common source of "my warehouse is slow" issues at scale. Switching from UUID sort to timestamp sort can reduce data scanned from 5 TB to 50 GB for time range queries, cutting latency from 60 seconds to 3 seconds.

Data Skew and Hot Keys:

Skewed data distributions create operational challenges. If a small subset of values dominates a column, dictionary or run length encoding may perform extremely well, but hot keys can cause uneven work distribution across nodes during joins and aggregates.

For example, a customer events table where one large enterprise customer generates 40 percent of all events. When you join this table with a smaller dimension table on customer ID, the shuffle phase sends 40 percent of data to a single worker node. That node becomes a bottleneck while others sit idle, degrading overall query performance. Some systems use dynamic repartitioning or adaptive execution to mitigate this, but as a designer you need to be aware of data skew as a first class concern.

Skew Impact on Query Time
BALANCED
8 sec
→
SKEWED
45 sec
Wide Table Metadata Overhead:

Very wide tables with hundreds or thousands of columns stress metadata systems and cache capacity. Each query must load metadata for every column it might access, and systems typically cache column statistics in memory. A table with 5,000 columns and 100,000 row groups generates 500 million metadata entries.

Metadata lookups or statistics loading can become a hidden performance bottleneck, sometimes taking longer than the actual data scan. This manifests as high query planning time before execution even starts. Production teams address this by splitting extremely wide tables into multiple narrower tables or using nested column formats like Parquet structs to reduce top level column count.

Corruption and Partial Write Handling:

From a reliability perspective, corruption or partial writes in a column chunk can make a single column unreadable for a subset of rows. Good systems store strong checksums like CRC32 or XXHash and per page metadata so they can detect corruption quickly and fail queries fast or fall back to replicas.

They also isolate failures: corruption in one column chunk does not affect others. This is why columnar formats store metadata at multiple levels. A corrupted page in the email column should not prevent reading the order_amount column from the same row group.

💡 Key Takeaways

✓Compaction backlog under high update load can increase query latency from 2 seconds to 15 seconds as the number of micro partitions grows from 5,000 to 50,000 due to segment fragmentation

✓Poor sort key choice like random UUID instead of timestamp makes min and max statistics useless, increasing data scanned from 50 GB to 5 TB and query time from 3 seconds to 60 seconds

✓Data skew where one customer generates 40 percent of events causes uneven work distribution during joins, degrading query time from 8 seconds balanced to 45 seconds skewed

✓Wide tables with 5,000 columns and 100,000 row groups generate 500 million metadata entries, causing metadata loading to become a bottleneck exceeding actual scan time

📌 Interview Tips

1A Redshift deployment saw compaction fall behind during a daily batch load, causing query latency to spike from 3 seconds to 20 seconds until manual VACUUM completed

2An analytics table sorted by user UUID required scanning 8 TB for time range queries, but re clustering by timestamp reduced scans to 80 GB with 100x latency improvement

← Back to Columnar Storage Internals Overview