Learn→Data Pipelines & Orchestration→Backfill & Reprocessing Strategies→3 of 6

Data Pipelines & Orchestration • Backfill & Reprocessing StrategiesMedium⏱️ ~3 min

Production Reality: Scale & Validation

The Numbers That Matter:
At Netflix scale, a single data pipeline might process 200,000 events per second. Over 24 hours, that is 17.3 billion events. Compressed and stored, this becomes 8 to 12 TB per day. When you need to reprocess 90 days of history, you are moving 720 TB to 1.08 PB of data.

Running this naively would consume the entire data platform for days. Instead, production systems implement careful resource management. A backfill might be limited to 500 concurrent Spark executors out of a cluster with 5,000 total, completing in 36 to 48 hours while leaving 90 percent of capacity for daily production workloads.

Versioned Datasets:
Large companies handle logic changes with versioned datasets. Instead of overwriting user_engagement in place, you create user_engagement_v3 and backfill it completely with new logic. Both versions coexist during validation. Dashboards still read from v2 while data engineers compare distributions, check for anomalies, and validate that changes match expectations.

Only after validation passes do you atomically switch queries to read from v3. If something is wrong, rollback is instant: just point back to v2. This pattern trades storage cost (keeping multiple versions) for safety and debuggability.

LinkedIn Reprocessing Scale
10 TB/day
PER DAY VOLUME
90 days
BACKFILL WINDOW
900 TB
TOTAL MOVED
Validation is Not Optional:
Before promoting a reprocessed dataset, production teams run extensive validation. This includes statistical checks: comparing record counts, key distributions, null rates, and aggregate metrics between old and new versions. For example, if total revenue in the old version is $45.2 million for January and the new version shows $45.8 million, that 1.3 percent difference needs explanation. Is it the bug fix? Or did the reprocessing introduce new errors?

Some teams require that 95 percent of dimensions (like country, product category, user segment) differ by less than 0.5 percent, or that any larger differences are documented as expected fixes. This prevents silently shipping new bugs during reprocessing.

Handling Schema Evolution:
A tricky production challenge is schema changes over time. An event log from 2022 might have different fields than 2024. Naive reprocessing code that expects the latest schema will fail or silently drop old events. Robust systems maintain schema registries with versioning. Transformation logic checks the schema_version field in each event and applies the appropriate parsing logic.

For example, events before June 2023 use user_id as an integer. After that, it is a UUID string. Reprocessing code handles both, mapping old integers to the new UUID format during transformation.

"At scale, backfill is not a one-off script. It is a first class workflow with resource limits, progress tracking, validation gates, and safe rollback mechanisms."

💡 Key Takeaways

✓Processing 90 days at 10 TB per day (900 TB total) at Netflix or LinkedIn requires throttling to 10 to 20 percent of cluster capacity to maintain production SLAs

✓Versioned datasets like <code>user_engagement_v3</code> enable side by side comparison and instant rollback before promoting to production

✓Validation requires checking that 95 percent of dimensions differ by less than 0.5 percent, or documenting larger changes as expected fixes

✓Schema evolution handling is critical: maintain schema registries and apply version-specific parsing logic per event

✓At 200,000 events per second, a single day produces 17.3 billion events and 8 to 12 TB of compressed storage

📌 Interview Tips

1LinkedIn pattern: compute <code>user_engagement_v3</code> for full 90 days, compare distributions with v2, validate revenue totals within 0.5%, then atomically switch dashboards to read from v3

2Schema handling: events with <code>schema_version=1</code> (pre June 2023) use integer <code>user_id</code>, version 2 uses UUID. Reprocessing code checks version field and applies appropriate mapping.

← Back to Backfill & Reprocessing Strategies Overview