Loading...
Data Pipelines & Orchestration • Backfill & Reprocessing StrategiesHard⏱️ ~4 min
Advanced: Incremental State & Dependency Management
The Dependency Problem:
In real data platforms, tables depend on other tables. Your
Selective Reprocessing by Entity:
Another advanced pattern is entity scoped backfill. Instead of reprocessing all users for 90 days, you reprocess only affected users. For example, a pricing bug affected only users in the European Union. Raw event logs are partitioned by date and region. You backfill only
daily_revenue table joins to user_segments to compute revenue per segment. When you reprocess user_segments, you must also reprocess daily_revenue and everything downstream. This forms a directed acyclic graph (DAG) of dependencies.
Naive backfill approaches reprocess the entire DAG for the affected time range. But this is wasteful. If only one table in a 20 table pipeline changed logic, you do not need to recompute the other 19. The challenge is tracking which tables are affected and propagating reprocessing only where needed.
Incremental State Propagation:
Sophisticated systems use incremental state markers. Each table partition has metadata indicating its version and the versions of its upstream dependencies. When you reprocess user_segments with new logic and write user_segments_v3, the orchestrator marks all downstream partitions as stale.
The next scheduled run of daily_revenue sees that its upstream dependency changed and automatically triggers a reprocessing of its affected partitions. This propagates through the DAG: daily_revenue reprocesses, marks monthly_rollups as stale, and so on.
This is how Airbnb and similar companies manage backfill at scale: they do not manually rerun every downstream job. The orchestrator detects staleness and automatically schedules reprocessing, respecting resource limits and priority.
"At 50+ tables in a dependency graph, manual backfill coordination becomes impossible. Automated staleness propagation is the only scalable approach."
region=EU partitions, leaving US and Asia untouched.
This slashes cost and time. EU is 20 percent of your user base, so you process 180 TB instead of 900 TB. But it introduces complexity. Your aggregates now mix old and new logic by region. Queries that span regions must be aware of this. Debugging becomes harder: "Is this metric wrong because of the bug, or because we only fixed EU?"
Entity scoped backfill works best when entities are isolated, for example separate customer tenants in a Business to Business (B2B) SaaS platform. Each tenant's data is independent. Reprocessing tenant 42 has no impact on tenant 57. For consumer products where aggregates span all users, entity scoped backfill is riskier.
Coordinating Batch and Streaming:
A sophisticated production challenge is coordinating backfill with real-time streaming pipelines. Suppose your streaming pipeline consumes Kafka, processes events, and writes to the same tables as your batch backfill. During backfill, you must ensure streaming does not overwrite or conflict with batch outputs.
One pattern is to pause streaming for the affected time range and let backfill handle it. After backfill completes, streaming resumes from a later Kafka offset. This guarantees no conflicts but creates a temporary gap in real-time processing.
Another pattern is priority-based overwrites. Batch backfill writes with a higher priority version tag. When the streaming pipeline tries to write the same partition, it checks the version. If batch already wrote a higher version, streaming skips the write. This allows both to run concurrently without conflicts.
At Uber scale, streaming processes 500,000 events per second while batch backfills 90 days in parallel. Coordination is done with partition locking: backfill acquires a lock on partitions it is reprocessing. Streaming sees the lock and either skips those partitions or queues events for later replay. Locks are released after batch completes and validation passes.
✓ In Practice: At Airbnb, dependency aware backfill cuts reprocessing time by 60 to 70 percent by only recomputing affected downstream tables, not the entire DAG.
When Backfill Becomes Migration:
Sometimes "backfill" is actually a full data migration: moving from one storage system to another, changing partitioning schemes, or fundamentally restructuring data models. For example, migrating from Hive to Iceberg table format while also reprocessing with new logic.
This requires a phased approach. First, dual write new data to both old and new systems. Second, backfill historical data to the new system. Third, validate that old and new produce identical results for overlapping time ranges. Fourth, switch reads to the new system. Finally, sunset the old system after a grace period.
At LinkedIn scale, this can take months. Dual writing increases infrastructure cost by 30 to 50 percent during migration. But it allows safe validation and instant rollback if issues arise. The alternative, a risky "big bang" cutover, is rarely acceptable for production data platforms.💡 Key Takeaways
✓Dependency aware backfill tracks staleness: when <code>user_segments</code> reprocesses, orchestrator automatically marks <code>daily_revenue</code> stale and triggers reprocessing
✓At Airbnb scale, dependency propagation cuts backfill time by 60 to 70 percent by only recomputing affected tables, not the entire 50+ table DAG
✓Entity scoped backfill (only EU users, 180 TB) is 5x cheaper than full backfill (900 TB) but creates mixed logic across regions
✓Coordinating streaming and batch: Uber uses partition locking so 500,000 events per second streaming does not conflict with 90 day batch backfill
✓Data migration requires dual write (30 to 50 percent cost increase), backfill, validation, cutover, and grace period before sunsetting old system
📌 Examples
1Staleness propagation: Airflow task checks <code>upstream_version</code> metadata. If <code>user_segments</code> changed from v2 to v3, <code>daily_revenue</code> task sees staleness and reruns for affected dates.
2Partition locking: Batch backfill acquires Zookeeper lock on <code>revenue/date=2024-03-15</code>. Streaming job checks lock before writing; if locked, queues event to Kafka for later replay.
Loading...