Search & Ranking SystemsQuery Parsing & OptimizationMedium⏱️ ~2 min

Predicate Pushdown and Projection Pruning: Minimizing Data Movement

Predicate pushdown and projection pruning are foundational query rewrite optimizations that minimize Input/Output (I/O) and network transfer by filtering and projecting data as early as possible in the execution pipeline. Predicate pushdown moves filter conditions closer to the data source, allowing storage engines or remote systems to evaluate predicates and return only matching rows. Projection pruning eliminates unreferenced columns, reducing the bytes read and transmitted. Amazon Aurora leverages parallel query to push predicates into the storage layer, filtering at storage nodes before sending data to compute. Internal benchmarks report 2 to 3x speedups on star schema scans by avoiding transfer of filtered out blocks, turning multi second scans into sub second responses. Google BigQuery pushes predicates to columnar storage (Capacitor format), scanning only required columns and evaluating filters in place. With well partitioned and clustered data, predicate pushdown combined with partition pruning routinely reduces scanned bytes by over 90%, directly lowering both query latency (from minutes to seconds) and cost (per byte scanned pricing). The failure mode is non sargable predicates: wrapping indexed or partitioned columns in functions or implicit type casts prevents pushdown. For example, WHERE YEAR(order_date) = 2024 cannot push the filter to a date partitioned table, forcing a full scan. Rewriting as WHERE order_date BETWEEN '2024-01-01' AND '2024-12-31' enables partition pruning and columnar filter pushdown, often reducing scanned data by 10 to 100x.
💡 Key Takeaways
Amazon Aurora parallel query pushes predicates to storage nodes, achieving 2 to 3x speedups on star schema scans by filtering blocks before transfer, turning multi second scans into sub second queries
Google BigQuery predicate pushdown and partition pruning reduce scanned bytes by over 90%, directly lowering latency (minutes to seconds) and cost (pay per byte scanned model)
Projection pruning reads only required columns in columnar formats. Scanning 3 of 50 columns reduces I/O by approximately 15x compared to row oriented full table scans
Non sargable predicates (functions on columns, implicit casts) block pushdown and force full scans. WHERE YEAR(date) = 2024 scans entire table; WHERE date BETWEEN rewritten form enables partition pruning
Trade off: pushdown works best with columnar storage and partitioning. Row stores benefit less. For key value access patterns, prefer NoSQL over relying on relational pushdown
📌 Examples
BigQuery partition pruning: Query on daily partitioned table with WHERE date = '2024-06-15' scans 1 partition (1 day) instead of 365, reducing scan from 500 GB to 1.4 GB and cost from dollars to cents.
Amazon Athena on Parquet: SELECT user_id, revenue FROM sales WHERE country = 'US' reads only user_id, revenue, country columns and evaluates country filter in Parquet reader, scanning 10 GB instead of 100 GB.
Anti pattern: WHERE DATE_TRUNC('month', order_date) = '2024-06' prevents partition/index usage. Rewrite as WHERE order_date >= '2024-06-01' AND order_date < '2024-07-01' to enable pruning and drop scan from 1 terabyte (TB) to 30 gigabytes (GB).
← Back to Query Parsing & Optimization Overview