Database DesignColumn-Oriented Databases (Redshift, BigQuery)Medium⏱️ ~3 min

How Column Oriented Storage Transforms Analytical Query Performance

Column oriented databases flip traditional row storage on its head by storing all values for each column together, rather than grouping fields within each row. When your analytical query needs to aggregate revenue across 1 billion transactions, a row store must read every single field of all billion rows even though you only care about one column. A column store reads just the revenue column, instantly cutting I/O by 20x if your table has 20 columns. This layout unlocks aggressive compression because consecutive values in a column share the same type and often exhibit patterns. A status column with only 5 distinct values across millions of rows compresses dramatically with dictionary encoding (map "pending" to integer 1, store millions of 1s, then run length encode). Real world compression typically achieves 3 to 10x reduction, with low cardinality columns hitting even higher ratios. Combined with column pruning, you might scan 100x fewer bytes than a naive row store approach. Modern column stores execute queries through vectorized operators that process thousands of values per CPU instruction, plus late materialization that defers assembling full rows until absolutely necessary. They push predicates down to storage, skip entire data blocks using min/max zone maps, and prune partitions before reading any data. A BigQuery query filtering on date and user_id might eliminate 95% of data through partition pruning and clustering before scanning a single byte. The tradeoff is write complexity. Columnar segments are immutable, so updates and deletes require writing delta files or copying entire segments. A workload with frequent single row updates will suffer high write amplification and require constant compaction. This architecture shines for append heavy analytics with batch ingestion, not transactional systems requiring millisecond updates.
💡 Key Takeaways
Column pruning eliminates reading unnecessary fields. Query selecting 5 of 100 columns reduces I/O by 20x before compression even applies.
Compression achieves 3 to 10x reduction on typical event data. Low cardinality columns like status or category compress even higher with dictionary and run length encoding.
Vectorized execution processes thousands of values per CPU instruction. Combined with late materialization, queries avoid materializing full rows until final output.
Zone maps and partition pruning skip entire data blocks. BigQuery scanning 10 TB of data might actually read only 500 GB after pruning 95% through partition and cluster elimination.
Write amplification makes updates expensive. Single row update requires rewriting entire columnar segment or maintaining delta files, then compacting later.
Latency targets seconds not milliseconds. Interactive queries over billions of rows return in single digit to tens of seconds, not sub 100ms like OLTP systems.
📌 Examples
Netflix queries petabyte scale Parquet datasets via distributed SQL engines. With effective partitioning and predicate pushdown, TB scale queries complete in seconds to minutes, achieving 10x+ compression over raw JSON.
BigQuery query filtering 1 billion transactions by date range and status. Partition pruning on date eliminates 90% of data, clustering on status skips another 80% of remaining blocks. Final scan reads 2 GB instead of 100 GB, completing in 3 seconds at $0.01 cost instead of 30 seconds at $0.50.
← Back to Column-Oriented Databases (Redshift, BigQuery) Overview