Backfill Cost and Throughput Planning
Compute Cost Reality
Production backfills consume significant compute and storage resources, requiring careful capacity planning and cost budgeting. A typical baseline is 5 to 20 terabytes per hour throughput on a 100 worker batch cluster when scanning columnar formats like Parquet or ORC with predicate pushdown and partition pruning. This translates to roughly 2,000 vCPU hours per TB of raw data processed.
Cost Estimation Framework
For a 90 day backfill across 100 million entities with 50 features: raw data volume is approximately 10 TB, compute cost is $500 to $2,000 on spot instances, and storage cost is $100 to $500 per month for the output feature table. Total initial backfill investment of $1,000 to $3,000 is typical, with ongoing storage costs.
Optimization Levers
Partition pruning skips irrelevant date partitions, reducing data scanned by 10 to 100x for targeted backfills. Predicate pushdown filters rows at the storage layer before reading into memory. Incremental backfills recompute only changed entities rather than the full population. Spot instances reduce compute cost by 60 to 80 percent with 2 to 3x longer completion times due to interruptions.
Priority and SLA Planning
Classify backfills by urgency: critical (model launch blocked, SLA of hours), normal (scheduled retraining, SLA of days), and background (experimental features, SLA of weeks). Allocate dedicated compute quota for critical backfills while running background work on spare capacity.
Cost Monitoring
Track cost per feature per backfill to identify expensive features consuming disproportionate resources. A single complex join or aggregation can dominate total backfill cost. Optimize or approximate expensive features when the cost exceeds the value they provide.