How Column Oriented Storage Transforms Analytical Query Performance
Why Column Storage Matters
When aggregating revenue across 1 billion transactions, a row store reads every field (customer name, address, notes) even though only revenue is needed. At 500 bytes per row, that is 500 GB from disk. A columnar store reads only the revenue column: 8 bytes times 1 billion equals 8 GB, a 60x reduction in I/O before compression even applies.
Compression Techniques
Adjacent column values share data types and patterns, enabling powerful compression. Dictionary encoding replaces repeated strings (like "pending", "shipped", "delivered") with small integers, storing the dictionary once. Run-length encoding represents consecutive identical values as (value, count) pairs: 1000 rows of "pending" become a single (pending, 1000). Delta encoding stores differences between sorted values (timestamps differing by seconds store only the delta). Combined, these achieve 3-10x compression ratios on typical event data.
Zone Maps and Pruning
Columnar stores maintain zone maps (min/max metadata per data block). A query filtering amount > 1000 skips blocks where max(amount) < 1000. With partitioning (grouping data by date), a query on the last 7 days can skip 95% of a 2-year dataset instantly. These pruning techniques reduce scanned bytes by 10-100x beyond column selection.
Trade-offs
The cost is write performance and point lookups. Inserting one row writes to every column file separately. Fetching a complete record gathers data from many column segments. This makes columnar stores poor for OLTP (Online Transaction Processing: high-frequency single-row operations like bank transfers) but exceptional for OLAP (Online Analytical Processing: aggregations over large datasets like "total revenue by region").