What is Data Quality Monitoring in ML Systems?
WHY DATA QUALITY MATTERS
ML models are only as good as their data. A model trained on clean data will fail silently when fed garbage. Unlike traditional software that crashes on bad input, ML models produce predictions—they just produce wrong predictions.
Data quality issues are the leading cause of ML system failures in production. Studies show 60-80% of ML pipeline failures trace back to data issues: missing features, schema changes, value range violations, or upstream pipeline failures.
QUALITY DIMENSIONS
Completeness: Are all expected values present? Null rates, missing features, partial records.
Consistency: Do values follow expected formats and constraints? Categorical values match expected set, numerical values in valid ranges.
Timeliness: Is data fresh enough? Stale data (features computed on old data) can be as harmful as missing data.
Accuracy: Do values represent reality? Hardest to measure—requires ground truth comparison or domain knowledge.
DATA QUALITY VS DATA DRIFT
Data quality and data drift are related but distinct. Data quality issues are bugs—the data is wrong. Data drift is the world changing—the data is correct but different from training. A feature becoming all nulls is a quality issue. A feature distribution shifting because user behavior changed is drift.
Different responses: quality issues need bug fixes; drift needs model adaptation. Detecting which you are dealing with is crucial for taking the right action.