Predicate Pushdown and Projection Pruning: Minimizing Data Movement
Predicate Pushdown
Predicate pushdown moves filter conditions closer to the data source. Consider: SELECT o.id, c.name FROM orders o JOIN customers c ON o.customer_id = c.id WHERE c.created_at > '2024-01-01'. Without pushdown, we read all 10 million orders, join to 1 million customers, then filter. With pushdown, we filter customers first (1M to 200K rows), then join. The optimizer rewrites this as: SELECT o.id, c.name FROM orders o JOIN (SELECT * FROM customers WHERE created_at > '2024-01-01') c ON o.customer_id = c.id. Result: 80% fewer rows processed in the join.
Projection Pruning
Projection pruning eliminates unreferenced columns. For SELECT customer_id, total FROM orders on a table with 50 columns averaging 100 bytes each (5KB per row), we only need 2 columns (~20 bytes). In row storage, all columns still read from disk but fewer bytes transfer over network. In columnar storage like Parquet, only the customer_id and total column files are read: 96% less I/O. A 1 terabyte table with 50 columns queried for 2 columns reads only 40 gigabytes from disk.
Pushdown Across Systems
Federated queries push predicates to remote sources. For SELECT * FROM remote_db.orders WHERE created_at > '2024-01-01' AND status = 'pending', pushing both predicates to the remote database filters there, transferring only matching rows. Without pushdown: transfer 10 million rows, then filter locally. With pushdown: remote returns 50,000 matching rows. Network transfer drops from 5 gigabytes to 25 megabytes. Object storage queries similarly benefit: scanning 1TB of Parquet with WHERE region = 'US' and projection of 3 columns might read only 10 gigabytes when pushdown and partition pruning combine.
Partition Pruning
Partition pruning eliminates entire data partitions. For a table partitioned by date: SELECT * FROM events WHERE event_date = '2024-06-15' on 365 daily partitions containing 1TB total reads only that day's partition (2.7 gigabytes). But SELECT * FROM events WHERE event_timestamp BETWEEN '2024-06-15 23:00' AND '2024-06-16 01:00' spans two partitions because the predicate crosses midnight. Partition keys should align with common query patterns.