Statistics and Cardinality Estimation: Foundation of Cost Modeling
Statistics Foundation
Accurate cardinality estimation is the cornerstone of cost based optimization. For SELECT * FROM orders WHERE status = 'pending', the optimizer needs to estimate how many rows match. Statistics include: table row counts (orders has 10 million rows), column distinct values (status has 5 distinct values), and value distributions. If status values are uniformly distributed, the optimizer estimates 2 million rows (10M / 5). If 80% of orders are 'completed' and only 1% are 'pending', the estimate should be 100,000 rows. Wrong estimates lead to wrong plans.
Histogram Types
Histograms capture value distributions for skewed data. Consider SELECT * FROM user_events WHERE user_id = 12345. If 1% of users generate 50% of events, uniform distribution assumption fails. Equi width histograms divide the value range into equal sized buckets: user_ids 1 to 10000 in bucket 1, 10001 to 20000 in bucket 2. Equi depth histograms divide into buckets with equal row counts: bucket 1 might contain user_ids 1 to 100 (all power users), bucket 2 contains 101 to 50000 (casual users). Equi depth handles skew better. Most databases store 100 to 200 histogram buckets per column.
Stale Statistics
Statistics become stale as data changes. If orders doubled from 5 million to 10 million since last ANALYZE, the optimizer underestimates join sizes by 2x and may pick nested loop when hash join is better. For SELECT * FROM orders o JOIN customers c ON o.customer_id = c.id WHERE c.region = 'EU' AND o.currency = 'EUR', correlated predicates cause errors. If 90% of EU customers use EUR, the true selectivity is 90%, but treating predicates independently estimates 81% (0.9 times 0.9). Modern databases auto collect statistics after 10% of rows change.
Compounding Estimation Errors
Cardinality errors compound multiplicatively through query plans. For a 4 table join like SELECT * FROM orders o JOIN customers c ON ... JOIN products p ON ... JOIN categories cat ON ..., if each join has 2x estimation error, the final estimate is off by 16x (2^4). A 16x underestimate causes memory spills when hash tables exceed allocated memory. A 16x overestimate wastes memory and parallelism. Production systems commonly see 10x to 1000x errors for complex queries with correlated predicates.