Trade-offs: Full vs. Incremental Backfill

The Core Decision:
When logic changes, should you reprocess all historical data or just a targeted time window? This is not a theoretical question. It directly impacts cost, time to repair, data consistency, and operational risk.

Full Backfill
All history consistent, but 2 to 7 days compute, multiple PB moved
vs
Incremental (90 days)
Faster, cheaper, but old data uses old logic
When to Choose Full Backfill:
Full backfill makes sense when you need consistent definitions across your entire history. Machine learning training is the classic case. If you are training a fraud detection model on 2 years of data, having "fraud score" calculated with one algorithm for 2022 to 2023 and a different algorithm for 2024 corrupts your labels. Model accuracy degrades because the target variable is inconsistent.

Another scenario is regulatory reporting or audits where you must provide consistent metrics across arbitrary time ranges. If an auditor asks for revenue trends over 3 years, having piecewise definitions creates compliance risk.

The cost is significant. Reprocessing 2 years at 10 TB per day means moving 7.3 PB. On a large Spark cluster with 2,000 executors, this takes 3 to 7 days even with 24/7 processing. At cloud pricing of roughly $0.10 per GB processed, that is $730,000 in compute cost.

When to Choose Incremental Backfill:
Incremental backfill reprocesses only recent data, typically 30 to 90 days. This is appropriate when older data is rarely queried or when you can tolerate definition changes over time. Business dashboards often fall here. A product manager looking at last quarter's metrics does not care if 2022 data uses a slightly different definition.

The math favors incremental: 90 days at 10 TB per day is 900 TB, completing in 1 to 2 days with 10 to 20 percent of cluster capacity. Cost drops to roughly $90,000. For many organizations, this 8x cost reduction and 3x to 5x faster repair time outweighs having inconsistent historical data.

Full Backfill Cost Comparison
2 YEARS
7.3 PB
→
90 DAYS
900 TB
Hybrid Approach:
Many large companies use a hybrid strategy. They fully reprocess the last 90 to 180 days for active use cases, but leave older data untouched unless specifically needed. When someone requests a 3 year analysis, they either accept that pre-2024 data uses old logic, or they trigger a targeted backfill for only the needed metrics and time ranges.

This pragmatic approach balances cost and consistency. You get fast repair for recent data where most queries land (90 percent of analytics queries touch data less than 6 months old), while avoiding the expense of reprocessing rarely accessed archives.

Decision Framework:
Ask these questions: First, what is the read pattern? If 95 percent of queries are last 90 days, incremental wins. Second, is this a training dataset for ML? If yes, full backfill for consistency. Third, what is your error tolerance? Revenue and compliance metrics need full backfill. Engagement metrics can often use incremental. Fourth, what is the urgency? If you need repairs in production tomorrow, incremental gets you there. Full backfill is for planned migrations.

⚠️ Common Pitfall: Teams often default to full backfill for perfection, then regret it when the job takes 5 days, costs $500,000, and blocks other critical work. Start with incremental for most cases.

💡 Key Takeaways

✓Full backfill (2 years, 7.3 PB) costs roughly $730,000 and takes 3 to 7 days but ensures complete consistency for ML training or compliance

✓Incremental backfill (90 days, 900 TB) costs roughly $90,000 and completes in 1 to 2 days, acceptable for dashboards where old data is rarely queried

✓Machine learning training requires full backfill to avoid label inconsistency that degrades model accuracy

✓90 percent of analytics queries touch data less than 6 months old, making incremental backfill sufficient for most business use cases

✓Hybrid strategy: fully reprocess 90 to 180 days for active queries, leave older archives unless specifically requested

📌 Interview Tips

1ML scenario: Training fraud model on 2 years of transactions requires full backfill so <code>fraud_score</code> uses consistent algorithm across all training examples

2Dashboard scenario: Product engagement metrics for last quarter can use incremental 90 day backfill; 2022 data with old logic is acceptable since no one queries it

← Back to Backfill & Reprocessing Strategies Overview