How Column Oriented Storage Transforms Analytical Query Performance

Definition
Column-oriented databases store all values for each column together rather than grouping fields within rows. This layout dramatically reduces I/O for analytical queries scanning few columns across billions of rows.
Why Column Storage Matters
When aggregating revenue across 1 billion transactions, a row store reads every field (customer name, address, notes) even though only revenue is needed. At 500 bytes per row, that is 500 GB from disk. A columnar store reads only the revenue column: 8 bytes times 1 billion equals 8 GB, a 60x reduction in I/O before compression even applies.
Compression Techniques
Adjacent column values share data types and patterns, enabling powerful compression. Dictionary encoding replaces repeated strings (like "pending", "shipped", "delivered") with small integers, storing the dictionary once. Run-length encoding represents consecutive identical values as (value, count) pairs: 1000 rows of "pending" become a single (pending, 1000). Delta encoding stores differences between sorted values (timestamps differing by seconds store only the delta). Combined, these achieve 3-10x compression ratios on typical event data.
Zone Maps and Pruning
Columnar stores maintain zone maps (min/max metadata per data block). A query filtering amount > 1000 skips blocks where max(amount) < 1000. With partitioning (grouping data by date), a query on the last 7 days can skip 95% of a 2-year dataset instantly. These pruning techniques reduce scanned bytes by 10-100x beyond column selection.
Trade-offs
The cost is write performance and point lookups. Inserting one row writes to every column file separately. Fetching a complete record gathers data from many column segments. This makes columnar stores poor for OLTP (Online Transaction Processing: high-frequency single-row operations like bank transfers) but exceptional for OLAP (Online Analytical Processing: aggregations over large datasets like "total revenue by region").

💡 Key Takeaways

✓Column pruning eliminates reading unnecessary fields. Query selecting 5 of 100 columns reduces I/O by 20x before compression applies

✓Dictionary encoding replaces repeated strings with integers; run-length encoding stores (value, count) pairs; delta encoding stores differences between sorted values

✓Combined compression achieves 3-10x reduction on typical data; low cardinality columns (status, category) compress even better

✓Zone maps store min/max per block enabling predicate pushdown; query filtering amount>1000 skips blocks where max<1000

✓Partition pruning on date column skips 95% of multi-year dataset when filtering recent data, reducing scan from 100TB to 5TB

✓OLTP (frequent single-row updates) belongs in row stores; OLAP (aggregations over billions of rows) belongs in columnar stores

📌 Interview Tips

1Explain I/O reduction with math: 1B rows at 500 bytes/row = 500GB. Selecting 1 column (8 bytes) = 8GB. With 5x compression = 1.6GB. Show the full calculation.

2Describe compression techniques: dictionary encodes "shipped" as integer 3, run-length stores 1000 consecutive "pending" as single (pending, 1000), delta stores timestamps as differences.

3Explain zone maps: block has min=500, max=900 for amount column. Query WHERE amount>1000 skips entire block. 1000 blocks with 990 pruned = 99% I/O eliminated.

← Back to Column-Oriented Databases (Redshift, BigQuery) Overview