Learn→Data Pipelines & Orchestration→Backfill & Reprocessing Strategies→2 of 6

Data Pipelines & Orchestration • Backfill & Reprocessing StrategiesMedium⏱️ ~3 min

How Backfill & Reprocessing Work

The Foundation: Immutable Raw Storage
The entire strategy depends on one architectural principle: your raw data must be immutable and retained. Companies like LinkedIn and Netflix keep months or years of compressed event logs in object storage (S3, GCS, HDFS). This becomes your source of truth. When you need to backfill or reprocess, you always go back to these raw logs, never to live production databases.

Why? Because production databases change. A user record today looks different than it did 6 months ago. But raw event logs capture exactly what happened at that moment in time.

The Backfill Workflow:
A typical backfill follows this pattern:

1
Identify the range: Determine which date partitions need processing, for example 2024-01-01 to 2024-03-31.
2
Fan out tasks: Orchestrator (Airflow, Luigi, or internal scheduler) creates one job per partition for parallel processing.
3
Read raw data: Each task reads compressed logs from object storage for its date, decompresses, parses, applies transformations.
4
Write to staging: Results go to temporary storage, not directly to production tables.
5
Atomic swap: After validation, the partition is atomically replaced. Queries see either all old data or all new data, never a mix.
Real Numbers:
At Uber scale, a single day partition might contain 500 GB to 2 TB of raw events. Processing 90 days in parallel across a cluster with 500 workers, each handling 2 partitions per hour, completes in roughly 1 hour of wall clock time. But you need throttling: running at 100 percent cluster capacity starves production jobs. Most teams target 20 to 30 percent of cluster resources for backfills, extending completion to 3 to 5 hours but maintaining production Service Level Agreements (SLAs).

⚠️ Common Pitfall: Never backfill by querying live production databases for months of historical data. This creates massive read spikes that can degrade user-facing latency from 50ms to 500ms or worse.

💡 Key Takeaways

✓Always backfill from immutable raw storage (object storage logs), never from live production databases

✓Fan out work by partition: 90 days becomes 90 parallel jobs for faster completion

✓Throttle resource usage to 20 to 30 percent of cluster capacity to avoid impacting production SLAs

✓Use atomic partition swaps: write to staging, validate, then replace production partitions in one operation

✓At Uber scale, processing 500 GB to 2 TB per day across 90 days with 500 workers completes in 3 to 5 hours

📌 Interview Tips

1Airflow DAG with 90 tasks, one per date partition, each reading from S3 path <code>s3://events/date=2024-01-01/</code> and writing to staging table <code>metrics_staging</code>

2Validation step compares record counts: old partition has 1.2 billion rows, new partition has 1.201 billion rows (0.08% difference, acceptable)

← Back to Backfill & Reprocessing Strategies Overview