Search & Ranking SystemsQuery Parsing & OptimizationMedium⏱️ ~2 min

Statistics and Cardinality Estimation: Foundation of Cost Modeling

Accurate cardinality estimation is the cornerstone of cost based optimization. The optimizer relies on statistics to predict how many rows each operation will produce, which determines whether to use an index, which join algorithm to choose, and whether to broadcast or repartition data. Statistics include table and partition row counts, Number of Distinct Values (NDV) for each column, histograms showing value distributions, and multi column correlation data for common predicate combinations. Data skew and correlations are the primary causes of misestimation. If a histogram shows uniform distribution but the actual data has heavy skew (for example, 90% of rows have status equals active), the optimizer may choose a nested loop join expecting few rows when a hash join would be better for millions. Stale statistics compound the problem: if a table has grown 10x since the last statistics update, the optimizer underestimates costs and selects plans that spill to disk or trigger Out Of Memory (OOM) errors. Amazon Redshift and Google BigQuery rely on automated statistics collection and incremental updates on hot partitions to keep estimates fresh. Misestimations of 10 to 1000x are common and catastrophic. A query estimated to return 1000 rows but actually returning 1 million will choose an index nested loop join, performing 1 million index probes instead of a single hash join. Symptoms include memory spills, massive shuffles in distributed systems, and P99 latency spikes. Google BigQuery aggressively prunes partitions and pushes predicates to reduce scanned bytes by over 90%, directly mitigating estimation errors by narrowing the data domain.
💡 Key Takeaways
Histograms, Number of Distinct Values (NDV) sketches, and multi column correlation stats drive selectivity estimates. Misestimates of 10 to 1000x cause plan disasters like choosing nested loop over hash join
Stale statistics are critical failure mode: table grows 10x but optimizer uses old row counts, selecting plans that spill to disk or trigger Out Of Memory (OOM) on workers
Data skew breaks uniformity assumptions in histograms. If 90% of rows share one value, the optimizer underestimates result size and chooses wrong join strategies
Google BigQuery partition pruning and predicate pushdown reduce scanned bytes by over 90%, narrowing the estimation domain and directly lowering both cost and latency risk
Trade off: frequent statistics updates improve accuracy but consume resources and invalidate cached plans. Balance update cadence based on data volatility and query sensitivity
📌 Examples
Skew example: Orders table with status column. Values: 90% active, 10% other. Histogram shows uniform distribution. Query WHERE status = 'active' estimated at 10% selectivity, actually 90%, causing plan to choose index scan (slow) over full scan (fast).
Amazon Redshift incremental statistics: Hot partitions updated every few hours, cold partitions weekly. High insert rate tables get daily stats refresh to avoid 10x row count drift.
Google BigQuery partition pruning: Query with WHERE date BETWEEN '2024-01-01' AND '2024-01-07' on daily partitioned table scans 7 partitions instead of 365, reducing bytes scanned from 500 gigabytes (GB) to 10 GB and runtime from 30 seconds to under 2 seconds.
← Back to Query Parsing & Optimization Overview