Feature Engineering & Feature StoresBackfilling & Historical FeaturesMedium⏱️ ~3 min

What is Feature Backfilling in ML Systems?

Feature backfilling is the controlled recomputation of feature values for historical time ranges to create point in time correct training datasets. When you launch a new feature or fix a bug in feature logic, you cannot wait months for fresh data to accumulate. Instead, you retroactively compute those features over past data so your training set has complete coverage. The critical requirement is point in time correctness: a feature computed as of time t must use only information available strictly before t. If you are computing a user's 7 day purchase count as of January 15th, you can only include purchases from January 8th through January 14th. This prevents data leakage where future information contaminates training labels. Airbnb's Zipline system reduced feature onboarding from weeks to days by automating this process, enabling teams to backfill 30 to 365 days of history before training. Backfills operate on immutable raw event logs with explicit event time versus processing time semantics. The same feature definition runs both offline (for backfill and training) and online (for serving predictions), ensuring no training serving skew. Uber's Michelangelo platform processes hundreds of terabytes per day this way, maintaining point in time semantics across batch backfills and streaming online features. Production backfills must be idempotent and restartable. Idempotency means rerunning the same backfill converges to identical results, achieved through deterministic upserts keyed by entity id plus timestamp plus a monotonic feature version. Restartability relies on partition checkpointing so a 10 hour backfill job can resume from hour 7 if it fails, without reprocessing or corrupting earlier results.
💡 Key Takeaways
Point in time correctness requires features computed as of time t to use only data available strictly before t, preventing label leakage that would inflate training accuracy but fail in production
Airbnb Zipline achieves single digit millisecond p50 latency (2 to 10 ms) for online lookups while backfilling 30 to 365 days of history offline in hours to days
Idempotency through deterministic upserts keyed by entity id, feature name, timestamp, and monotonic version ensures reruns converge to identical results without duplicates
Uber Michelangelo reprocesses hundreds of terabytes per day using copy on write or merge on read patterns with partition level checkpointing for restartability
The same feature definition executes offline for backfill and online for serving to eliminate training serving skew that causes model accuracy drops of 20% or more in production
📌 Examples
Netflix reprocesses hundreds of billions of events per day to build training datasets after logic changes, completing full day reprocesses within overnight maintenance windows using multi terabyte per hour throughput
Meta Feature Store runs billions of feature computations daily with single digit millisecond median online reads and nightly training dataset generation (10^9 to 10^11 rows) completing within 12 hours
A ride sharing platform backfills user's 30 day trip count feature: for training row at January 20th, counts trips from December 21 to January 19 inclusive, strictly excluding January 20 and later to avoid leakage
← Back to Backfilling & Historical Features Overview
What is Feature Backfilling in ML Systems? | Backfilling & Historical Features - System Overflow