Feature Engineering & Feature StoresBackfilling & Historical FeaturesMedium⏱️ ~3 min

What is Feature Backfilling in ML Systems?

Definition
Feature backfilling is the controlled recomputation of feature values for historical time ranges to create point in time correct training datasets. When you launch a new feature or fix a bug in feature logic, you retroactively compute those features over past data so your training set has months or years of examples.

Why Backfilling is Essential

Machine learning models require substantial historical data for training. A new user engagement feature cannot wait 6 months to accumulate enough examples. Backfilling enables immediate model iteration by computing what the feature would have been at each historical timestamp, using only data available at that time.

Point in Time Correctness Requirement

Backfills must be point in time correct: a feature computed for January 15th must only use data that existed on January 14th or earlier. Using current state (what we know now) instead of historical state (what we knew then) causes label leakage, inflating offline metrics while degrading production performance by 5 to 20 percent.

Scale Considerations

Production backfills process terabytes to petabytes of historical data. A typical baseline is 5 to 20 TB per hour throughput on a 100 worker Spark cluster using columnar formats with partition pruning. Backfilling 1 year of daily features for 100 million entities might take 10 to 50 hours of compute time.

The Reproducibility Foundation

Backfills enable reproducibility: given the same raw data and feature definitions, you can recreate any historical training dataset. This supports model audits, debugging production issues, and understanding why a model made specific predictions months ago.

💡 Key Takeaways
Point in time correctness requires features computed as of time t to use only data available strictly before t, preventing label leakage that would inflate training accuracy but fail in production
Airbnb Zipline achieves single digit millisecond p50 latency (2 to 10 ms) for online lookups while backfilling 30 to 365 days of history offline in hours to days
Idempotency through deterministic upserts keyed by entity id, feature name, timestamp, and monotonic version ensures reruns converge to identical results without duplicates
Uber Michelangelo reprocesses hundreds of terabytes per day using copy on write or merge on read patterns with partition level checkpointing for restartability
The same feature definition executes offline for backfill and online for serving to eliminate training serving skew that causes model accuracy drops of 20% or more in production
📌 Interview Tips
1Netflix reprocesses hundreds of billions of events per day to build training datasets after logic changes, completing full day reprocesses within overnight maintenance windows using multi terabyte per hour throughput
2Meta Feature Store runs billions of feature computations daily with single digit millisecond median online reads and nightly training dataset generation (10^9 to 10^11 rows) completing within 12 hours
3A ride sharing platform backfills user's 30 day trip count feature: for training row at January 20th, counts trips from December 21 to January 19 inclusive, strictly excluding January 20 and later to avoid leakage
← Back to Backfilling & Historical Features Overview
What is Feature Backfilling in ML Systems? | Backfilling & Historical Features - System Overflow