Big Data SystemsColumnar Storage & CompressionMedium⏱️ ~3 min

How Do Statistics, Zone Maps, and Predicate Pushdown Enable Data Skipping at Scale?

Columnar formats maintain rich metadata at multiple granularities: per page (64 kilobytes to 1 megabyte), per segment or row group (64 to 512 megabytes), and per partition. For each column chunk, systems store minimum and maximum values, null counts, distinct value estimates, and optionally bloom filters. This metadata enables predicate pushdown: query engines evaluate filter conditions against statistics before reading any data pages, skipping entire segments or pages that provably contain no matching rows. Consider a query filtering on timestamp greater than yesterday. The engine reads partition metadata, identifies that partitions older than last week have maximum timestamp below the threshold, and skips terabytes of data without a single disk read. Within relevant partitions, it checks segment level zone maps (min/max per segment) and skips segments outside the time range. Finally, page level stats prune individual pages. Amazon Redshift users report that when sort and distribution keys align with query predicates, billions of rows scan in seconds because 80% to 95% of data never leaves storage. Bloom filters add probabilistic skipping for high selectivity predicates on columns without natural ordering. A bloom filter on user identifier can determine with certainty that a segment does NOT contain a specific userId, avoiding false positive scans. However, bloom filters have tunable false positive rates (typically 1% to 5%); when oversubscribed or poorly configured, they waste Input/Output (I/O) on segments that ultimately yield no rows. Google BigQuery reports that Capacitor page level min/max combined with column pruning allows queries over 10 to 100 terabytes to complete in seconds when predicates are selective. The effectiveness of data skipping hinges on data layout. If a table is unsorted or partitioned on a dimension uncorrelated with query filters, zone maps cannot prune effectively. A table sorted by timestamp but queried by userId will scan nearly all pages despite having excellent metadata. Production systems invest heavily in choosing sort keys and partition schemes that align with the dominant query patterns, often maintaining multiple clustered copies or materialized views for different access patterns.
💡 Key Takeaways
Systems store min/max values, null counts, distinct estimates, and bloom filters at page (64 KB to 1 MB), segment (64 to 512 MB), and partition granularity for predicate evaluation
Predicate pushdown evaluates filters against metadata before reading data, skipping entire partitions, segments, or pages that provably contain no matching rows
Amazon Redshift scans billions of rows in seconds when sort/distribution keys align with filters, achieving 80% to 95% data skipping via zone map pruning
Bloom filters enable probabilistic skipping for high selectivity predicates but have tunable false positive rates (1% to 5%); poor configuration wastes I/O on empty segments
Data layout is critical: tables sorted by timestamp but queried by userId see minimal pruning; production systems choose sort keys and partitions aligned with dominant query patterns
📌 Examples
Google BigQuery Capacitor: page level min/max plus column pruning allows queries over 10 to 100 terabytes to complete in seconds when predicates aggressively prune segments
Meta Presto on ORC: stripe level statistics and bloom filters yield over 90% stripe pruning on selective predicates, converting multi minute scans into seconds on shared clusters
AWS Athena scanning Parquet on S3: partition pruning plus column projection reduces scanned bytes by 80% to 90% compared to full table scans, directly cutting query cost and latency
← Back to Columnar Storage & Compression Overview