What is Feature Backfilling in ML Systems?
Why Backfilling is Essential
Machine learning models require substantial historical data for training. A new user engagement feature cannot wait 6 months to accumulate enough examples. Backfilling enables immediate model iteration by computing what the feature would have been at each historical timestamp, using only data available at that time.
Point in Time Correctness Requirement
Backfills must be point in time correct: a feature computed for January 15th must only use data that existed on January 14th or earlier. Using current state (what we know now) instead of historical state (what we knew then) causes label leakage, inflating offline metrics while degrading production performance by 5 to 20 percent.
Scale Considerations
Production backfills process terabytes to petabytes of historical data. A typical baseline is 5 to 20 TB per hour throughput on a 100 worker Spark cluster using columnar formats with partition pruning. Backfilling 1 year of daily features for 100 million entities might take 10 to 50 hours of compute time.
The Reproducibility Foundation
Backfills enable reproducibility: given the same raw data and feature definitions, you can recreate any historical training dataset. This supports model audits, debugging production issues, and understanding why a model made specific predictions months ago.