Loading...
Data Warehousing FundamentalsColumnar Storage InternalsEasy⏱️ ~3 min

What is Columnar Storage?

Definition
Columnar Storage is a database layout where each column of data is stored separately and contiguously, rather than storing complete rows together. Instead of organizing data as entire records side by side, it groups all values from each column together.
The Core Problem: Traditional row oriented databases store data the way you think about it: one complete record after another. When you fetch a user profile with 50 fields, reading one row makes sense. But modern analytics queries have a completely different pattern. They scan billions of rows yet only need 5 columns out of 300, and they almost always aggregate the results. Imagine a query that calculates average order value across 1 billion orders. The query needs just two columns: order_id and total_amount. But your orders table has 80 columns. With row storage, you must read all 80 columns for every single row, even though you discard 78 of them immediately. At terabyte scale, this wasted I/O becomes catastrophic. How Columnar Storage Works: Columnar storage reorganizes data physically on disk. Instead of storing [row 1: all columns], [row 2: all columns], it stores [column A: all rows], [column B: all rows]. When your query needs only 5 columns, the database reads only those 5 column files. This is called column pruning, and it can reduce I/O by 50 to 100 times for wide tables. But the benefits go deeper. Because all values in a column have the same data type and often similar patterns, compression becomes incredibly effective. Techniques like dictionary encoding and run length encoding can shrink data by 5 to 20 times. A column of repeated country codes compresses beautifully. A column mixing random data types does not.
✓ In Practice: Nearly every modern data warehouse uses columnar storage: Snowflake, BigQuery, Redshift, and even data lake query engines like Presto and Spark default to columnar formats such as Parquet or ORC.
The Key Insight: Columnar storage trades write complexity for read performance. Writing a single row requires touching multiple column files instead of appending to one place. But for analytics workloads where each query scans millions of rows and you run far more reads than writes, this tradeoff delivers 10 to 50 times better performance.
💡 Key Takeaways
Columnar storage organizes data by column instead of by row, storing all values for each column together on disk
Column pruning allows queries to read only the columns they need, reducing I/O by 50 to 100 times for wide tables with many unused columns
Compression is 5 to 20 times more effective because values in a single column have the same type and similar patterns, enabling dictionary encoding and run length encoding
Analytics queries that scan billions of rows but touch few columns see 10 to 50 times better performance compared to row oriented storage at the same scale
📌 Examples
1A query selecting AVG(order_amount) from 1 billion orders with 80 columns only reads the order_amount column file, avoiding 98.75% of unnecessary I/O
2Snowflake, BigQuery, Redshift, and Apache Parquet all use columnar storage internally for their analytics workloads
← Back to Columnar Storage Internals Overview
Loading...