Feature Engineering & Feature StoresBackfilling & Historical FeaturesMedium⏱️ ~3 min

Backfill Cost and Throughput Planning

Compute Cost Reality

Production backfills consume significant compute and storage resources, requiring careful capacity planning and cost budgeting. A typical baseline is 5 to 20 terabytes per hour throughput on a 100 worker batch cluster when scanning columnar formats like Parquet or ORC with predicate pushdown and partition pruning. This translates to roughly 2,000 vCPU hours per TB of raw data processed.

Cost Estimation Framework

For a 90 day backfill across 100 million entities with 50 features: raw data volume is approximately 10 TB, compute cost is $500 to $2,000 on spot instances, and storage cost is $100 to $500 per month for the output feature table. Total initial backfill investment of $1,000 to $3,000 is typical, with ongoing storage costs.

Optimization Levers

Partition pruning skips irrelevant date partitions, reducing data scanned by 10 to 100x for targeted backfills. Predicate pushdown filters rows at the storage layer before reading into memory. Incremental backfills recompute only changed entities rather than the full population. Spot instances reduce compute cost by 60 to 80 percent with 2 to 3x longer completion times due to interruptions.

Priority and SLA Planning

Classify backfills by urgency: critical (model launch blocked, SLA of hours), normal (scheduled retraining, SLA of days), and background (experimental features, SLA of weeks). Allocate dedicated compute quota for critical backfills while running background work on spare capacity.

Cost Monitoring

Track cost per feature per backfill to identify expensive features consuming disproportionate resources. A single complex join or aggregation can dominate total backfill cost. Optimize or approximate expensive features when the cost exceeds the value they provide.

💡 Key Takeaways
Baseline throughput of 5 to 20 terabytes per hour on 100 worker clusters translates to 10 hour backfills costing $2,000 to $4,000 at $1 to $2 per vCPU hour
Window length dominates cost: 180 day unique user aggregates scan 25 times more data than 7 day windows, often making full history recomputes prohibitively expensive
Teams cap per feature backfills at 24 hours wall clock and $5,000 budget, prioritizing features that provide at least 1% to 2% model accuracy lift
Bottom up partition listing (list once, filter) reduces wall clock time by 30% to 50% versus top down existence checks by minimizing cloud storage API calls
Separate compute pools for backfills prevent resource contention with production streaming jobs but may slow completion if capacity is capped
📌 Interview Tips
1Uber ingests hundreds of terabytes daily; a full 365 day backfill over this volume would cost hundreds of thousands of dollars, forcing incremental reprocessing strategies with state checkpointing
2A recommendation model feature providing 0.5% AUC lift is not backfilled over 12 months (estimated $8,000 cost); instead, team waits 60 days for data to accrue naturally
3Netflix schedules large backfills during overnight off peak windows on isolated batch clusters achieving multi terabyte per hour throughput without impacting real time recommendation serving
← Back to Backfilling & Historical Features Overview