ETL/ELT Patterns • Data Deduplication StrategiesHard⏱️ ~3 min
Failure Modes and Edge Cases in Deduplication
Unstable Keys:
The most common failure is using keys that are not truly unique or stable. Using email as a user identifier without normalizing case and whitespace leads to under deduplication.
[email protected] and [email protected] are treated as different users. Using timestamps rounded to the nearest second can cause over deduplication: two legitimate events at the same second are merged.
Another variant is auto generated surrogate keys like database sequence IDs. If two services generate keys independently, they can produce duplicates with different IDs, bypassing deduplication entirely. Always use business natural keys like order_uuid or external_user_id.
Race Conditions in Distributed Systems:
Two microservices processing the same entity concurrently can each perform an upsert based on stale state, creating duplicates that violate uniqueness. Without true transactional semantics across shards, different partitions may each accept a copy.
For example, a user updates their profile on two devices simultaneously. Both API calls hit different backend servers. Each reads the current profile, applies changes, and writes back. The final state depends on race timing, and some systems end up with two profile records instead of one.
❗ Remember: Without database level unique constraints or distributed transactions, application layer dedup is best effort only. Race conditions will create duplicates during concurrent writes.
Late Arrivals:
Mobile apps queue events offline and replay hours or days later. Clock skew between servers can cause events to arrive out of order. If your streaming dedup window is 6 hours but an event arrives 12 hours late, it bypasses dedup and is treated as new.
This creates a subtle bug. The streaming layer accepts the late event. The nightly batch job sees both the original and the late duplicate. But because the batch job runs with a partition filter like event_date = today, it only scans today's partition and misses the original from yesterday. The duplicate survives.
The fix is lookback windows in batch dedup. Scan not just today but the last 3 to 7 days to catch cross partition duplicates.
CDC Misconfiguration:
Change Data Capture systems emit database changes as events. If updates are emitted as insert events without operation type metadata, downstream systems treat every update as a new entity. A user profile that changes 10 times appears as 10 separate users.
During backfills, CDC can replay the entire history. If consumers do not deduplicate on primary key plus operation type, they double count everything. A table with 100 million rows becomes 200 million after a backfill.
Partial Failures:
A batch dedup job deletes duplicates in a partition, then crashes before reloading the correct canonical versions. The partition now has zero records instead of deduplicated records. This is silent data loss.
Robust systems use atomic swap patterns. Write deduplicated results to a new table or partition. Validate row counts and metrics. Only after validation, atomically swap the new partition into production. If the job crashes, the old partition remains intact.
Batch Job Failure Timeline
NORMAL
100M rows
→
DELETE DUPES
80M rows
→
CRASH
0 rows
"The hardest bugs in deduplication are not algorithmic. They are operational: race conditions, partial failures, and late arrivals that your window assumptions do not account for."
💡 Key Takeaways
✓Unstable keys like non normalized email or auto generated IDs cause under deduplication. Always use business natural keys like <code>order_uuid</code>.
✓Race conditions in distributed writes create duplicates even with application layer dedup. Require database unique constraints or distributed transactions.
✓Late arrivals beyond the streaming window bypass dedup. Batch jobs need 3 to 7 day lookback windows to catch cross partition duplicates.
✓CDC misconfiguration emitting updates as inserts causes double counting. A 100 million row table becomes 200 million after a backfill without proper dedup.
✓Partial batch job failures can silently delete data. Use atomic swap patterns: write to new table, validate, then swap into production.
📌 Examples
1A dedup key uses <code>email + timestamp</code> rounded to seconds. Two legitimate signups at the same second merge into one user, losing data.
2Two API servers concurrently update the same user profile. Both read stale state and write back. The database ends up with two profile records for one user.
3A mobile event arrives 18 hours late. The streaming window is 12 hours, so it passes through. The batch job scans only today's partition, missing yesterday's original. Both survive.
4A CDC system replays 500 million rows during a backfill, emitting updates as inserts. Without <code>operation_type</code> filtering, the warehouse doubles to 1 billion rows.
5A batch job deletes duplicates in a 100 million row partition and crashes. The partition is now empty. Without atomic swaps, 100 million rows are lost.