Failure Modes and Edge Cases in Production Reproducibility
Training Serving Skew
Training serving skew occurs when models are trained on batch features but served with real time features, causing up to 20 percent accuracy drop. A ranking model trained on yesterday aggregated user statistics but served with current session features sees distribution shift. Data time-travel gaps happen when source systems overwrite or delete data; reproduced runs silently read different inputs. An append only raw table gets compacted or a 90 day retention policy deletes the training window you need to replay. Fix with append only storage, snapshotting, and dataset fingerprints using content hashes that fail loudly when data is missing.
Feature Skew
Feature skew between offline and online systems is subtle. Training might compute a 30 day click through rate using exact timestamps, while serving uses a cached daily aggregate that updates at midnight. The model learns on precise signals but predicts with stale approximations. Uber Zipline solves this by version controlling feature definitions and materializations, ensuring training and serving read from the same logical feature pipeline. Partial logging or missing lineage happens when teams forget to log configs or datasets; models enter the registry without provenance. Fix with pipeline gates that block artifact registration unless mandatory fields are present.
PII Leakage and Security
PII leakage in metadata occurs when teams log configurations or parameters that include secrets or PII. An experiment config might accidentally log a database connection string or customer IDs. Fix with client side redaction, allowlisted fields, and automated detectors that scan metadata for patterns like credit card numbers. Encrypt sensitive fields at rest.
Statistical Overfitting and Infrastructure Issues
Statistical overfitting to the test set inflates Type I error when iterating many runs against the same validation slice. After evaluating 100 hyperparameter combinations, the best one likely overfit to noise. Fix with pre-registered evaluation protocols, multiple repeats for top candidates, and holdback test sets accessed only for final contenders. Metadata store hot spots happen during hyperparameter optimization bursts. Fix with write optimized append only event logs, eventual materialized views, and time based partitioning.