Parquet File Structure and Metadata

The Hierarchical Layout:

A Parquet file organizes data into three nested levels: row groups, column chunks, and pages. This structure is key to understanding how Parquet achieves both efficient compression and fast selective scans.

At the top level, a file contains one or more row groups. Each row group is a horizontal partition of rows, typically 64 to 512 MB uncompressed, containing 1 to 10 million rows. Think of a row group as a logical batch. Within each row group, data is physically split by column: each column gets its own column chunk, which is a contiguous byte range holding all values for that column in that row group.

Column chunks are further divided into pages, usually 1 to 1.5 MB each. Pages are the atomic unit of encoding and compression. When you write a Parquet file, you encode and compress each page independently, choosing the best encoding for that column's data distribution.

1
Row Groups: Horizontal partitions of 64 to 512 MB each, enabling parallel reads and fine grained row group skipping based on statistics.
2
Column Chunks: One per column per row group, storing all values for that column together to enable columnar scans.
3
Pages: Atomic units of 1 to 1.5 MB where encoding and compression happen, allowing page level skipping within a column chunk.
The Magic of Metadata:

Parquet files end with a footer that stores rich metadata. When a query engine opens a Parquet file, it reads only the footer first, which is typically a few hundred kilobytes even for multi gigabyte files. The footer contains the file schema, a list of all row groups, and for each row group the byte offset, size, and statistics for every column chunk.

These statistics include min, max, null count, and sometimes distinct count per column per row group. Query engines exploit this metadata for predicate pushdown. If your query filters on event_time between two dates, the engine checks each row group's min and max timestamps. Any row group whose range falls entirely outside the filter can be skipped without reading a single byte of actual data.

In practice, this can eliminate 70 to 90 percent of row groups for time range queries on partitioned data, turning a 100 TB scan into a 10 to 30 TB scan before even touching disk.

Predicate Pushdown Impact
WITHOUT
100 TB
→
WITH STATS
10 to 30 TB
Encoding and Compression:

Each page specifies its encoding type, number of values, and compressed and uncompressed sizes. Low cardinality columns like country might use dictionary encoding where you store unique values once and reference them by small integer IDs. Sorted integer columns like event_time might use delta encoding, storing only differences between consecutive values. After encoding, pages are compressed with codecs like Snappy (fast decompression) or Zstandard (better compression ratio).

This flexibility means Parquet adapts to your data. A string column with millions of unique values uses plain encoding, while a boolean column uses bit packing.

💡 Key Takeaways

✓Parquet files consist of row groups (64 to 512 MB each), column chunks (one per column per row group), and pages (1 to 1.5 MB atomic units)

✓The footer stores metadata including schema, row group locations, and per column statistics like min, max, and null count for each row group

✓Predicate pushdown uses row group statistics to skip entire row groups: a time range filter can eliminate 70 to 90 percent of data before reading from disk

✓Each column chunk can use different encodings: dictionary encoding for low cardinality strings, delta encoding for sorted integers, bit packing for booleans

✓Query engines read only the footer first (typically a few hundred KB), then selectively fetch only the column chunks and pages needed for the query

📌 Interview Tips

1A single day of clickstream data might be 5 to 20 TB uncompressed, written as Parquet with row groups of 256 MB. A query filtering on <code>event_time</code> reads metadata from 20,000 to 80,000 row groups in seconds, then fetches only matching chunks.

2Netflix or Uber might store 30 days of events as Parquet files in S3. A dashboard query selecting 5 columns from 200 reads perhaps 5 percent of bytes: 30 TB instead of 600 TB logical data.

3A <code>user_country</code> column with 50 distinct values uses dictionary encoding: store the 50 countries once, then reference by 6 bit IDs (2^6 = 64 possible values), dramatically reducing page size.

← Back to Parquet Format Internals Overview