Data Quality & Validation • Data Profiling & StatisticsEasy⏱️ ~2 min
What is Data Profiling?
Definition
Data Profiling is the systematic analysis of datasets to understand their structure, content quality, and relationships by computing concrete statistics and metadata about each table and column.
created_at column always contains timestamps within a valid range, not strings or nulls.
Second, Content Analysis examines actual values: frequency distributions, outliers, missing data, and uniqueness. For instance, a country_code column should contain roughly 250 valid values, with null rates below 0.1 percent.
Third, Relationship Analysis looks across columns and tables: checking primary and foreign key integrity, functional dependencies, and cross table consistency. Every user_id in an orders table should exist in the users table.
Why Statistics Matter:
Profiling becomes actionable when you attach concrete numbers to quality dimensions: completeness (percent non null), validity (percent matching patterns), consistency (cross field agreement), uniqueness (distinct count ratios), and timeliness (data freshness). Tracking these metrics as time series enables Service Level Objectives (SLOs) for data quality, just like you have SLOs for application latency or error rates.💡 Key Takeaways
✓Data profiling produces metadata and statistics about datasets: distinct counts, null percentages, min and max values, distributions, and relationship integrity checks
✓Three analysis types: structure (schema and format validation), content (value distributions and quality), and relationships (keys and cross table consistency)
✓Profiling attaches quantitative metrics to quality dimensions: completeness, validity, consistency, uniqueness, and timeliness tracked over time
✓Results enable data quality SLOs and prevent building analytics or ML models on unreliable data
📌 Examples
1A <code>country_code</code> column profiled shows 247 distinct values, 0.08% nulls, and top 3 values are US (42%), UK (18%), CA (12%)
2Relationship check discovers that 0.3% of <code>order_id</code> values in the shipments table have no matching record in the orders table, indicating a data integrity bug