Logging and Measurement: Building Training Data from Production
Production Feature Logging
The second pillar of skew prevention is explicit measurement through production feature logging. The gold standard is logging the exact feature vector used for each prediction at serving time, then building your next training dataset directly from these logs. This eliminates entire classes of skew because training literally uses what production saw, including all the quirks: missing values from timeouts, stale cache entries, upstream service failures, and edge cases that never appear in clean offline pipelines.
Scale Considerations
At 20,000 QPS with 1 kilobyte per feature vector, you generate 20 megabytes per second of logs, which is 72 gigabytes per hour or 1.7 terabytes per day. Real systems control this through sampling (log 1 to 10 percent of predictions), compression (Protocol Buffers or Avro), and schema minimization. Google's TFX style stacks log features, model outputs, and eventually observed labels, creating a closed loop where serving directly feeds training.
Blocking Skew Tests
This enables powerful validation workflows. Before deploying a new model, you run blocking skew tests: take 1,000 to 10,000 examples from your offline pipeline with expected outputs, load your packaged model with bundled transforms, run both paths, and assert output deltas below tight thresholds (maximum absolute difference less than 0.000001 for deterministic models). If there is a mismatch, you binary search the stack to localize the divergence.
Continuous Monitoring
Continuous monitoring extends this to live traffic. You compute distribution metrics between your training reference dataset and live serving slices: PSI warns above 0.1 and alerts above 0.2, KS statistics for continuous features, JS divergence for categorical distributions. When fraud models see PSI spike on device fingerprint features, it triggers investigation before false positive rates climb.