Loading...
Data Warehousing Fundamentals • Columnar Storage InternalsEasy⏱️ ~3 min
What is Columnar Storage?
Definition
Columnar Storage is a database layout where each column of data is stored separately and contiguously, rather than storing complete rows together. Instead of organizing data as entire records side by side, it groups all values from each column together.
order_id and total_amount. But your orders table has 80 columns. With row storage, you must read all 80 columns for every single row, even though you discard 78 of them immediately. At terabyte scale, this wasted I/O becomes catastrophic.
How Columnar Storage Works:
Columnar storage reorganizes data physically on disk. Instead of storing [row 1: all columns], [row 2: all columns], it stores [column A: all rows], [column B: all rows]. When your query needs only 5 columns, the database reads only those 5 column files. This is called column pruning, and it can reduce I/O by 50 to 100 times for wide tables.
But the benefits go deeper. Because all values in a column have the same data type and often similar patterns, compression becomes incredibly effective. Techniques like dictionary encoding and run length encoding can shrink data by 5 to 20 times. A column of repeated country codes compresses beautifully. A column mixing random data types does not.
✓ In Practice: Nearly every modern data warehouse uses columnar storage: Snowflake, BigQuery, Redshift, and even data lake query engines like Presto and Spark default to columnar formats such as Parquet or ORC.
The Key Insight:
Columnar storage trades write complexity for read performance. Writing a single row requires touching multiple column files instead of appending to one place. But for analytics workloads where each query scans millions of rows and you run far more reads than writes, this tradeoff delivers 10 to 50 times better performance.💡 Key Takeaways
✓Columnar storage organizes data by column instead of by row, storing all values for each column together on disk
✓Column pruning allows queries to read only the columns they need, reducing I/O by 50 to 100 times for wide tables with many unused columns
✓Compression is 5 to 20 times more effective because values in a single column have the same type and similar patterns, enabling dictionary encoding and run length encoding
✓Analytics queries that scan billions of rows but touch few columns see 10 to 50 times better performance compared to row oriented storage at the same scale
📌 Examples
1A query selecting AVG(order_amount) from 1 billion orders with 80 columns only reads the order_amount column file, avoiding 98.75% of unnecessary I/O
2Snowflake, BigQuery, Redshift, and Apache Parquet all use columnar storage internally for their analytics workloads
Loading...