What is Data Profiling?

Definition
Data Profiling is the systematic analysis of datasets to understand their structure, content quality, and relationships by computing concrete statistics and metadata about each table and column.
The Problem It Solves:
Modern systems generate massive amounts of data, but that data is often incomplete, inconsistent, or misunderstood. Building analytics dashboards, Machine Learning (ML) models, or user facing features on top of poor quality data leads to wrong metrics, biased models, and production incidents. Before you can trust your data, you need to answer: What is the actual shape and quality of this data, quantitatively?

Three Core Analysis Types:

First, Structure Analysis verifies that schemas and formats match expectations. For example, checking that a created_at column always contains timestamps within a valid range, not strings or nulls.

Second, Content Analysis examines actual values: frequency distributions, outliers, missing data, and uniqueness. For instance, a country_code column should contain roughly 250 valid values, with null rates below 0.1 percent.

Third, Relationship Analysis looks across columns and tables: checking primary and foreign key integrity, functional dependencies, and cross table consistency. Every user_id in an orders table should exist in the users table.

Why Statistics Matter:
Profiling becomes actionable when you attach concrete numbers to quality dimensions: completeness (percent non null), validity (percent matching patterns), consistency (cross field agreement), uniqueness (distinct count ratios), and timeliness (data freshness). Tracking these metrics as time series enables Service Level Objectives (SLOs) for data quality, just like you have SLOs for application latency or error rates.

💡 Key Takeaways

✓Data profiling produces metadata and statistics about datasets: distinct counts, null percentages, min and max values, distributions, and relationship integrity checks

✓Three analysis types: structure (schema and format validation), content (value distributions and quality), and relationships (keys and cross table consistency)

✓Profiling attaches quantitative metrics to quality dimensions: completeness, validity, consistency, uniqueness, and timeliness tracked over time

✓Results enable data quality SLOs and prevent building analytics or ML models on unreliable data

📌 Interview Tips

1A <code>country_code</code> column profiled shows 247 distinct values, 0.08% nulls, and top 3 values are US (42%), UK (18%), CA (12%)

2Relationship check discovers that 0.3% of <code>order_id</code> values in the shipments table have no matching record in the orders table, indicating a data integrity bug

← Back to Data Profiling & Statistics Overview