Loading...
Data Pipelines & Orchestration • Backfill & Reprocessing StrategiesHard⏱️ ~3 min
Failure Modes & Idempotency
The Biggest Failure: Double Counting
Imagine a daily revenue aggregation job that sums payment events into a table partitioned by date. A backfill job runs to fix a currency conversion bug. If the job simply inserts new rows without removing old ones, you have now counted every transaction twice. Revenue for March 15 shows $2.4 million instead of $1.2 million. Your CFO makes decisions on inflated numbers. This is catastrophic.
The fix is idempotency: every backfill and reprocessing job must produce the same output no matter how many times it runs. The most common pattern is complete partition overwrite. The job writes results to a temporary staging location, validates them, then atomically replaces the entire date partition. Running it twice produces identical final state.
Overwhelming Upstream Systems:
A classic failure mode is backfill jobs that query production databases for historical data. Suppose you backfill 6 months of user profile data by querying your PostgreSQL database for every user who was active each day. That is 180 days times 5 million active users per day, or 900 million queries. Even at 10 milliseconds per query, that is 2,500 hours of database load.
In practice, this manifests as production read latency spiking from 50 milliseconds p99 to 500 milliseconds or worse, triggering alerts and degrading user experience. The solution is to always backfill from immutable archives (object storage logs, database snapshots, change data capture streams), never from live production systems.
Schema Evolution Edge Cases:
A subtle failure happens when schemas change over time and reprocessing code is not aware. Consider an event stream where
Production Database Impact
NORMAL
50ms p99
→
BACKFILL
500ms p99
user_id was an integer before June 2023 and became a UUID string after. Naive reprocessing that expects UUIDs will either crash on old events or silently drop them.
Another case: a field is added in version 2 of the schema, so it does not exist in events before a certain date. Your transformation code assumes the field is always present and crashes with null pointer errors when processing 2022 data.
Robust systems maintain a schema registry and check the schema_version field in each event. Transformation logic branches: if version 1, parse and map fields one way; if version 2, use different logic. This allows safe reprocessing across schema boundaries.
Streaming and Backfill Collisions:
A tricky operational failure occurs when a streaming pipeline is still writing to a partition while you run a backfill that overwrites it. For example, today is March 20. Your streaming job is continuously writing March 20 data. You kick off a backfill for March 1 to March 20 to fix a bug. The backfill job completes and overwrites the March 20 partition with data from raw logs, but those logs were captured at midnight. The last 12 hours of streaming data is now lost.
The solution is to backfill only fully closed partitions. Most systems define a partition as closed once it is older than the maximum event lateness. For example, if events can arrive up to 24 hours late, only backfill partitions older than yesterday. Alternatively, pause or reroute streaming writes during backfill, then replay from Kafka offsets to ensure no data loss.
❗ Remember: Idempotency is non-negotiable. Every backfill job must use complete partition overwrites or deterministic keys to prevent double counting.
Side Effects Cannot Be Replayed:
Some pipelines trigger side effects: sending emails, charging credit cards, firing webhooks to third parties. You cannot simply rerun these. If your pipeline sends 100,000 welcome emails and you reprocess a week of data, you risk sending duplicate emails or incorrect charges.
The pattern is to separate data computation from side effect execution. Reprocessing computes the correct state: which users should have received emails. A separate idempotent reconciliation layer compares computed state to actual external state and emits only the delta. If user 12345 should have received an email but did not, send it. If they already received it, do nothing. This requires maintaining a record of executed side effects with unique idempotency keys.💡 Key Takeaways
✓Double counting happens when backfill jobs insert instead of replace: revenue shows $2.4M instead of $1.2M, corrupting all downstream analysis
✓Backfilling from live production databases can spike read latency from 50ms p99 to 500ms, degrading user experience
✓Schema evolution requires checking <code>schema_version</code> field: pre June 2023 events use integer <code>user_id</code>, post June use UUID
✓Streaming and backfill collisions: only backfill fully closed partitions (older than max event lateness, typically 24 to 48 hours)
✓Side effects like emails or charges cannot be replayed: use separate idempotent reconciliation layer that emits only deltas
📌 Examples
1Idempotent pattern: Spark job writes to <code>revenue_staging/date=2024-03-15</code>, validates, then <code>ALTER TABLE revenue DROP PARTITION (date='2024-03-15'); ALTER TABLE revenue ADD PARTITION (date='2024-03-15') LOCATION 's3://staging/...'</code>
2Schema handling: <code>if event.schema_version == 1: user_id = str(event.user_id_int) else: user_id = event.user_id_uuid</code>
Loading...