Feature Engineering & Feature StoresBackfilling & Historical FeaturesMedium⏱️ ~3 min

Backfill Cost and Throughput Planning

Production backfills consume significant compute and storage resources, requiring careful capacity planning and cost budgeting. A typical baseline is 5 to 20 terabytes per hour throughput on a 100 worker batch cluster when scanning columnar formats like Parquet or ORC with predicate pushdown and partition pruning. This translates to roughly 2,000 virtual CPU (vCPU) hours for a 10 hour backfill job. At cloud pricing of $1 to $2 per vCPU hour, that 10 hour backfill costs approximately $2,000 to $4,000 in compute alone, not including storage read costs or network egress. Teams typically cap per feature backfill budgets at 24 hours wall clock time and $5,000 dollars, forcing prioritization by model impact. Features that provide less than 1% accuracy lift may not justify a full historical recompute. The dominant cost driver is data volume times window length. A 180 day unique user count requires scanning 180 days of raw events for every training row, exploding both I/O and state size compared to a 7 day window. Uber reports ingesting hundreds of terabytes daily, making full reprocessing prohibitively expensive without incremental strategies. Netflix runs large backfills on separate compute pools during off peak windows to avoid starving production streaming jobs. Partition planning significantly impacts runtime. Listing existing date partitions once at job start (bottom up approach) and filtering for the target range outperforms issuing many small existence checks per partition (top down). The overhead is per request initialization to cloud object storage; minimizing request count can reduce wall clock time by 30% to 50% for jobs with thousands of partitions.
💡 Key Takeaways
Baseline throughput of 5 to 20 terabytes per hour on 100 worker clusters translates to 10 hour backfills costing $2,000 to $4,000 at $1 to $2 per vCPU hour
Window length dominates cost: 180 day unique user aggregates scan 25 times more data than 7 day windows, often making full history recomputes prohibitively expensive
Teams cap per feature backfills at 24 hours wall clock and $5,000 budget, prioritizing features that provide at least 1% to 2% model accuracy lift
Bottom up partition listing (list once, filter) reduces wall clock time by 30% to 50% versus top down existence checks by minimizing cloud storage API calls
Separate compute pools for backfills prevent resource contention with production streaming jobs but may slow completion if capacity is capped
📌 Examples
Uber ingests hundreds of terabytes daily; a full 365 day backfill over this volume would cost hundreds of thousands of dollars, forcing incremental reprocessing strategies with state checkpointing
A recommendation model feature providing 0.5% AUC lift is not backfilled over 12 months (estimated $8,000 cost); instead, team waits 60 days for data to accrue naturally
Netflix schedules large backfills during overnight off peak windows on isolated batch clusters achieving multi terabyte per hour throughput without impacting real time recommendation serving
← Back to Backfilling & Historical Features Overview