Data Quality & Validation • Data Profiling & StatisticsHard⏱️ ~3 min
Full Scan vs Sampling Trade offs
The Core Decision:
Every profiling system faces a fundamental choice: scan all data for exact metrics, or sample for approximate results at lower cost. The right answer depends on your correctness requirements, compute budget, and latency constraints.
When Full Scans Are Required:
First, primary key validation in Online Transaction Processing (OLTP) systems demands exact uniqueness guarantees. You cannot approximate whether a
Hybrid Strategies for Production:
Most companies use a layered approach. Run fast sampled checks on every batch for broad quality metrics and alerts. Schedule targeted full scans weekly or monthly on critical columns like primary keys, financial amounts, or personally identifiable information where exactness matters.
Another pattern is stratified sampling: sample heavily from recent partitions where issues are most likely, and lightly from older stable data. For a 365 day retention warehouse, profile the last 7 days at 10 percent sample, the prior 30 days at 1 percent, and the rest at 0.1 percent.
The Silent Drift Failure Mode:
Sampling has a blind spot. When issues are rare and localized, sampling misses them. If a payment gateway in one region starts writing incorrect currency codes for 0.001 percent of transactions, your 1 percent sample has only a 1 percent chance of including even one bad record. By the time you detect it through downstream complaints, thousands of transactions are wrong.
The mitigation is combining sampling for broad checks with critical column full scans on high impact fields and business rule validation on aggregates. For example, daily sum of
Full Scan
Exact metrics, linear cost with data size, can take hours for 100 TB
vs
Sampling (1%)
Approximate metrics, 100x cost reduction, completes in minutes
customer_id is unique; duplicates cause immediate correctness failures.
Second, financial reconciliation and compliance audits need exact counts. Reporting that you processed approximately 1.5 million transactions (plus or minus 15,000) is not acceptable when real number is 1,487,342.
Third, detecting rare but critical anomalies requires full scans. If a regional bug affects 0.0001 percent of rows (100 orders out of 1 billion), sampling 1 percent sees only 10 million rows and likely misses the issue entirely.
When Sampling Is Sufficient:
For distribution analysis and query optimization, approximate metrics work well. A query optimizer choosing between join orders does not need to know there are exactly 847,293 distinct values versus an estimate of 850,000 (0.3 percent error). The query plan remains the same.
For anomaly detection on aggregate trends, sampling captures shifts. If daily order volume drops 30 percent, you will see that in a 1 percent sample with high confidence. You are monitoring macro patterns, not hunting individual bad rows.
For schema validation and format checks, sampling provides fast feedback. Discovering that a timestamp column contains string values does not require scanning all 10 billion rows; seeing it in 10,000 sampled rows is enough to alert.
Cost Comparison For 100 TB Dataset
FULL SCAN
5000 CPU hrs
→
1% SAMPLE
50 CPU hrs
"The decision is not 'sample or scan.' It is: what is the blast radius if I miss this issue, and how much am I willing to pay to prevent it?"
payment_amount should match expected range based on historical patterns, even if individual row profiling uses sampling.💡 Key Takeaways
✓Full scans provide exact metrics but cost scales linearly with data: profiling 100 TB can take 5,000 Central Processing Unit (CPU) hours versus 50 hours for 1 percent sampling
✓Use full scans when exactness is critical: primary key uniqueness in OLTP, financial reconciliation, detecting rare anomalies affecting under 0.001 percent of rows
✓Use sampling when approximate metrics suffice: query optimization (0.3 percent cardinality error does not change plans), aggregate anomaly detection (30 percent drop visible in 1 percent sample)
✓Hybrid strategy works best: fast sampled checks on every batch for alerts, targeted full scans weekly on critical columns where exactness matters
📌 Examples
1A 1% sample of 1 billion rows (10 million sampled) will likely miss a bug affecting 0.0001% of rows (1,000 total bad records), requiring full scan for detection
2Query optimizer uses HyperLogLog estimate of 850,000 distinct values (actual 847,293) for join order selection; 0.3% error does not change the chosen plan