Parquet Failure Modes and Edge Cases

The Small Files Problem:

One of the most common operational failures in Parquet based systems is the small files problem. If your ingestion pipeline writes millions of tiny Parquet files, for example 1 to 10 MB each, metadata overhead and file listing latency dominate query performance. A query engine planning a scan must list all files, read each footer to gather statistics, and open potentially hundreds of thousands of file handles.

In practice, a table with 5 million small files can take 2 to 5 minutes just for query planning, even though the actual data volume is only 50 TB. The engine spends more time listing files on S3 or GCS and reading footers than actually scanning data. At scale, this increases p95 query latency from 20 seconds to 3 to 10 minutes, making dashboards unusable.

The solution is periodic compaction: rewrite small files into larger ones, typically with row groups of 128 to 512 MB. Production systems run compaction jobs hourly or daily, merging thousands of small files into hundreds of large files. After compaction, query planning drops from minutes to seconds, and scan performance improves because of better parallelism and fewer file open operations.

Query Planning Impact
5M SMALL FILES
3 to 10 min
→
AFTER COMPACTION
10 to 30 sec
Misleading Statistics and Data Skew:

Parquet row group statistics enable predicate pushdown, but they fail when data is heavily skewed or statistics are incorrect. Imagine a table partitioned by date where one partition accidentally contains events from multiple dates because of a bug. The row group min and max for event_time span a wide range, making the statistics useless for filtering.

A query that should skip this partition now has to scan it fully, turning a 5 second query into a 2 minute scan. Similarly, if 90 percent of matching rows are concentrated in a single row group because of poor sorting or partitioning, predicate pushdown does not help. The query still reads most of the data.

This is why data layout matters as much as file format. Production systems carefully design partitioning schemes (by date, region, or customer ID) and sort data within partitions to align with common query filters. Without this, Parquet's metadata becomes less effective.

⚠️ Common Pitfall: Writing Parquet files without sorting or partitioning aligned to query patterns results in statistics that cannot eliminate row groups, forcing full scans even when filters should be highly selective.
Schema Evolution and Compatibility:

Parquet files are self describing, but schema evolution across thousands of files in the same table is a common source of pain. Reordering columns, changing a field from optional to required, or narrowing data types (for example, changing int32 to int16) can make old files unreadable or cause subtle data corruption.

At petabyte scale with millions of files, these issues show up in specific partitions and are hard to detect. A query might succeed on 99 percent of data but fail on one partition written two years ago with a slightly different schema. Strong schema enforcement at write time and validation jobs that periodically check schema consistency are critical.

Table formats like Iceberg and Delta Lake help by maintaining a single source of truth schema in the transaction log, but teams must still be careful when evolving schemas to maintain backward compatibility.

Nested Data and Memory Pressure:

Parquet handles nested data (structs, lists, maps) using the Dremel model with repetition and definition levels. This works well for moderate nesting, but deeply nested structures or highly variable length arrays can blow up memory usage. A column with lists of lists of strings, where some rows have 10 elements and others have 10,000, creates huge definition level arrays and large dictionaries.

Readers may run out of memory when decoding these columns, causing executor failures in Spark or Presto. The symptom is sporadic out of memory errors on certain partitions while others succeed. The fix is to flatten the schema, split deeply nested columns into separate tables, or use more memory per executor. This is an edge case but appears frequently in event schemas with arbitrary JSON payloads flattened into Parquet.

Metadata Scaling:

At extreme scale, even reading Parquet footers becomes a bottleneck. A table with 100,000 files means 100,000 footer reads. At 10 ms per S3 GET request, that is 1,000 seconds (over 16 minutes) of serial metadata fetching. Query engines mitigate this with parallelism and caching, but if metadata is not cached or if the cache is cold, planning latency spikes.

Systems like Iceberg solve this by maintaining summary metadata (manifest files) that aggregate statistics across many Parquet files, reducing the number of footers that must be read. Without such a layer, metadata overhead limits scalability to tens of thousands of files per table.

💡 Key Takeaways

✓The small files problem occurs when millions of 1 to 10 MB Parquet files cause query planning to take 2 to 5 minutes just listing files and reading footers, increasing p95 latency from 20 seconds to 3 to 10 minutes

✓Periodic compaction that rewrites small files into larger ones (128 to 512 MB row groups) reduces query planning time from minutes to seconds and improves scan parallelism

✓Misleading or incorrect row group statistics caused by data skew, poor sorting, or partitioning bugs make predicate pushdown ineffective, forcing full scans even when filters should be selective

✓Schema evolution issues like reordering fields, changing optionality, or narrowing data types can make old Parquet files unreadable or cause subtle data corruption across millions of files at petabyte scale

✓Nested data with highly variable lengths (lists with 10 to 10,000 elements) can blow up memory usage in readers, causing out of memory errors on specific partitions while others succeed

📌 Interview Tips

1A streaming pipeline writes 1 MB Parquet files every 10 seconds, creating 8,640 files per day. After 90 days, the table has 777,600 files. Query planning takes 4 minutes. After compaction to 256 MB files, planning drops to 15 seconds.

2A table partitioned by date accidentally writes events from multiple dates into one partition because of a timestamp parsing bug. The row group <code>event_time</code> min and max span 30 days, making date filters useless. A query filtering to one day scans the entire partition.

3A company evolves their schema to add a new required field. Queries fail on partitions written before the change with 'missing required field' errors. They must either rewrite old files or make the field optional with a default value.

4An event schema includes a <code>metadata</code> column with nested JSON stored as a Parquet MAP of LIST of STRING. Some events have 5 keys, others have 5,000. Spark executors run out of memory decoding this column, failing on 2% of partitions.

← Back to Parquet Format Internals Overview