Statistics and Cardinality Estimation: Foundation of Cost Modeling

Statistics Foundation
Accurate cardinality estimation is the cornerstone of cost based optimization. For SELECT * FROM orders WHERE status = 'pending', the optimizer needs to estimate how many rows match. Statistics include: table row counts (orders has 10 million rows), column distinct values (status has 5 distinct values), and value distributions. If status values are uniformly distributed, the optimizer estimates 2 million rows (10M / 5). If 80% of orders are 'completed' and only 1% are 'pending', the estimate should be 100,000 rows. Wrong estimates lead to wrong plans.
Histogram Types
Histograms capture value distributions for skewed data. Consider SELECT * FROM user_events WHERE user_id = 12345. If 1% of users generate 50% of events, uniform distribution assumption fails. Equi width histograms divide the value range into equal sized buckets: user_ids 1 to 10000 in bucket 1, 10001 to 20000 in bucket 2. Equi depth histograms divide into buckets with equal row counts: bucket 1 might contain user_ids 1 to 100 (all power users), bucket 2 contains 101 to 50000 (casual users). Equi depth handles skew better. Most databases store 100 to 200 histogram buckets per column.
Stale Statistics
Statistics become stale as data changes. If orders doubled from 5 million to 10 million since last ANALYZE, the optimizer underestimates join sizes by 2x and may pick nested loop when hash join is better. For SELECT * FROM orders o JOIN customers c ON o.customer_id = c.id WHERE c.region = 'EU' AND o.currency = 'EUR', correlated predicates cause errors. If 90% of EU customers use EUR, the true selectivity is 90%, but treating predicates independently estimates 81% (0.9 times 0.9). Modern databases auto collect statistics after 10% of rows change.
Compounding Estimation Errors
Cardinality errors compound multiplicatively through query plans. For a 4 table join like SELECT * FROM orders o JOIN customers c ON ... JOIN products p ON ... JOIN categories cat ON ..., if each join has 2x estimation error, the final estimate is off by 16x (2^4). A 16x underestimate causes memory spills when hash tables exceed allocated memory. A 16x overestimate wastes memory and parallelism. Production systems commonly see 10x to 1000x errors for complex queries with correlated predicates.

💡 Key Takeaways

✓Statistics include row counts, distinct values per column (NDV), and histograms; for WHERE status = 'pending' on 10M rows with 5 status values, estimate is 2M rows if uniform

✓Equi depth histograms handle skew better: bucket 1 contains power users (1% of users generating 50% of events), bucket 2 contains casual users

✓Stale statistics after 2x data growth cause 2x cardinality underestimates; auto collection triggers after 10% row modification

✓Correlated predicates cause errors: WHERE region = 'EU' AND currency = 'EUR' with 90% correlation is estimated as 81% (0.9 x 0.9) when treated independently

✓Cardinality errors compound: 4 table join with 2x error per stage produces 16x total error; production queries see 10x to 1000x estimation errors

📌 Interview Tips

1Show histogram impact: SELECT * FROM user_events WHERE user_id = 12345. Power users (1% of IDs) generate 50% of rows. Without histogram, estimate is 100 rows. With histogram, estimate is 50,000 rows.

2Explain correlation error: WHERE region = 'EU' AND currency = 'EUR'. True overlap is 90%, but independence assumption gives 81%. Extended statistics on (region, currency) fix this.

3Walk through compounding: 4 table join, each stage 2x error. Stage 1: 2x. Stage 2: 4x. Stage 3: 8x. Stage 4: 16x. Final estimate is 16x off from reality.

← Back to Query Parsing & Optimization Overview