Big Data SystemsColumnar Storage & CompressionEasy⏱️ ~3 min

What is Columnar Storage and Why Does It Transform Analytical Performance?

Columnar storage physically organizes data by column rather than by row. Instead of storing entire records together (row 1: name, age, country; row 2: name, age, country), it groups all values of each column contiguously (all names together, all ages together, all countries together). This seemingly simple change delivers massive performance gains for analytical workloads. Consider a typical analytics query that scans billions of rows but only needs 3 columns out of 50. Row storage forces you to read all 50 columns worth of data from disk. Columnar storage reads only those 3 columns, reducing Input/Output (I/O) by up to 94% in this example. At Amazon Redshift scale, this translates to scanning gigabytes instead of terabytes for the same logical query. The performance win multiplies through compression. Values in a single column share the same data type and often exhibit patterns: repeated categories (countries like "US", "UK"), slowly incrementing timestamps, or limited distinct values. Google BigQuery's Capacitor format routinely achieves 5 to 10 times compression on web analytics columns through techniques like dictionary encoding (replacing "United States" with integer id 5 everywhere) and run length encoding (Run Length Encoding, storing "US" appears 10,000 times consecutively as a single entry). Production systems combine both benefits. AWS Athena customers moving from CSV to Parquet commonly see 70% to 95% reduction in data scanned, cutting both query time and cost proportionally. A 1 terabyte per day landing zone shrinks to 100 to 300 gigabytes, and queries that took minutes complete in seconds because you're reading one tenth the bytes and those bytes decompress into the original data.
💡 Key Takeaways
Columnar storage groups data by column instead of row, enabling queries to read only needed columns and skip irrelevant data entirely
I/O reduction is dramatic: scanning 3 columns out of 50 reads 94% less data, turning terabyte scans into gigabyte scans at systems like Amazon Redshift
Compression ratios of 3 to 10 times are typical because column values share types and patterns (repeated categories, sorted timestamps, limited distinct values)
Real production impact: AWS Athena users see 70% to 95% reduction in scanned data when converting CSV to Parquet, cutting cost and latency proportionally
Google BigQuery achieves 5 to 10 times compression on web analytics columns using dictionary encoding and Run Length Encoding (RLE) tailored to each column's characteristics
📌 Examples
Amazon Redshift: typical deployments achieve 3 to 6 times compression through per column encodings, scanning billions of rows in seconds for star schema aggregations when sort keys align with filters
AWS Athena pricing per terabyte scanned: converting 1 terabyte per day CSV logs to Parquet shrinks to 100 to 300 gigabytes per day (3 to 10 times compression), reducing monthly scan costs from $5000 to $500 to $1500
Google BigQuery: queries scanning 10 to 100 terabytes complete in seconds when predicates prune aggressively, enabled by per page min/max statistics and column pruning in Capacitor format
← Back to Columnar Storage & Compression Overview
What is Columnar Storage and Why Does It Transform Analytical Performance? | Columnar Storage & Compression - System Overflow