Data Storage Formats & OptimizationParquet Format InternalsEasy⏱️ ~3 min

What is Parquet Format?

Definition
Parquet is a columnar, self describing, binary file format designed specifically for analytical workloads where you need to scan massive amounts of data efficiently by reading only the columns your query needs.
The Core Problem: Imagine you have a table with 200 columns representing user events: user_id, event_time, event_type, country, and so on. You store 30 days of clickstream data, perhaps 600 TB uncompressed. Your dashboard query only needs 5 columns to count events by country and day. With row oriented formats like CSV or JSON, every row stores all 200 columns together. To get your 5 columns, you must read every single byte of all 600 TB, deserialize entire rows, then throw away 195 columns. This wastes enormous amounts of disk I/O, network bandwidth, and CPU time. How Parquet Solves This: Parquet stores each column separately within the file. Instead of storing row after row, it stores all values for event_time together, all values for country together, and so on. Your query engine reads only the 5 columns it needs, perhaps just 5 to 15 percent of the total bytes. That same 600 TB logical dataset might require reading only 30 to 90 TB from disk. The benefit compounds further because columnar storage enables aggressive compression. All values in a column have the same data type and often similar patterns. A country column with only 50 distinct values can use dictionary encoding. A timestamp column with sorted values can use delta encoding. These techniques typically achieve 3x to 10x compression ratios compared to text formats.
Storage and I/O Reduction
3x to 10x
COMPRESSION
5 to 15%
BYTES READ
Why It Matters: This combination of columnar layout and compression transforms analytical query performance. Scans that would take minutes or hours on CSV complete in tens of seconds. Storage costs drop dramatically. Query engines like Spark, Trino, and Athena can process petabyte scale datasets on reasonably sized clusters while keeping p95 latencies under 30 seconds for complex reports.
💡 Key Takeaways
Parquet is a columnar binary format where each column is stored separately, allowing queries to read only the columns they need instead of entire rows
Columnar layout enables aggressive compression because values in a column share the same type and often similar patterns, achieving 3x to 10x compression ratios
A query selecting 5 columns from a 200 column table reads only 5 to 15 percent of the bytes compared to row oriented formats like CSV or JSON
The format is self describing, meaning schema metadata is stored within the file itself, so readers know the structure without external dependencies
Parquet is optimized for Online Analytical Processing (OLAP) workloads where scans dominate and most queries touch a subset of columns across many rows
📌 Examples
1A clickstream table with 200 columns stores 30 days of data: 600 TB uncompressed in CSV. With Parquet, storage drops to 60 to 200 TB on disk. A dashboard query needing 5 columns reads 30 to 90 TB instead of all 600 TB.
2Query engines like Apache Spark, Trino, Presto, and AWS Athena all natively read Parquet and leverage its columnar layout for vectorized execution and predicate pushdown.
3A Parquet file storing user events might use dictionary encoding for the <code>country</code> field (50 distinct values) and delta encoding for sorted <code>event_time</code> timestamps, dramatically reducing file size.
← Back to Parquet Format Internals Overview
What is Parquet Format? | Parquet Format Internals - System Overflow