Columnar Storage Internal Architecture

The Physical Layout:

Columnar storage follows a hierarchical structure designed to balance sequential read efficiency with the ability to skip irrelevant data. At the top level, you have a file or segment, typically 100 MB to 1 GB uncompressed. This file divides horizontally into row groups, usually 16 to 512 MB each. Each row group contains column chunks (one per column), which split further into pages, often 8 KB to 1 MB.

Pages are the fundamental unit. They are the smallest piece the database reads from disk and the unit where encoding and compression happen. This multi level hierarchy matters because it creates multiple opportunities for skipping work.

1
File level: Query planner identifies which files to read based on partition keys, like event date or customer ID
2
Row group level: Statistics like min and max values per column allow skipping entire row groups that cannot match filter predicates
3
Column chunk level: Only column chunks needed by the query are read from disk
4
Page level: Decompression and decoding happen on these small units, enabling efficient CPU cache usage
Encoding and Compression:

Within each page, the system applies encoding schemes tuned to the column's type and distribution. For low cardinality strings like country codes or product categories, dictionary encoding maps values to small integer codes. If a column has just 50 distinct countries across 1 billion rows, you store a 50 entry dictionary and 1 billion integers instead of 1 billion full strings.

For repeated values, run length encoding stores value and count pairs. A column with long runs of the same status, like "pending, pending, pending," becomes "pending: 3." For narrow integer ranges, bit packing uses a fixed low bit width. If your integers fit in 12 bits instead of 32, you save 62.5% space.

After encoding, a compression algorithm like Snappy, LZ4, ZSTD, or Gzip operates on the encoded bytes. These algorithms trade off CPU cost versus compression ratio. Snappy is fast with moderate compression, while ZSTD offers better ratios at higher CPU cost.

Typical Compression Gains
5x to 20x
SIZE REDUCTION
8 KB to 1 MB
PAGE SIZE
Metadata for Data Skipping:

Each row group and column page stores critical metadata: min and max values, null counts, sometimes distinct counts, and optional bloom filters. Systems like Snowflake, Redshift, and BigQuery use this for aggressive data skipping. If your query filters on WHERE order_date BETWEEN '2024-01-01' AND '2024-01-31' and a row group's metadata shows its min date is 2024-05-01, the entire row group is skipped without reading a single byte.

Snowflake micro partitions, roughly 16 MB each, carry rich statistics that allow skipping 80 to 98 percent of partitions for common time range and customer filters. This transforms terabyte scans into hundred gigabyte scans at the I/O level.

💡 Key Takeaways

✓Files are organized into row groups of 16 to 512 MB, each split into column chunks, which divide into pages of 8 KB to 1 MB as the unit of I/O and compression

✓Dictionary encoding, run length encoding, and bit packing reduce data size by 5 to 20 times before general purpose compression like Snappy or ZSTD is applied

✓Metadata including min and max values per row group enables data skipping: Snowflake skips 80 to 98 percent of micro partitions for typical time range filters

✓Pages as the compression unit balance sequential read performance with memory efficiency: small enough to fit in CPU cache but large enough to compress well

📌 Interview Tips

1A WHERE order_date = '2024-06-15' filter skips row groups with min date of 2024-01-01 and max date of 2024-03-31 without reading any data

2A country column with 50 distinct values across 1 billion rows uses dictionary encoding: 50 string entries plus 1 billion small integers instead of 1 billion full strings

← Back to Columnar Storage Internals Overview