ETL/ELT Patterns • Data Deduplication StrategiesMedium⏱️ ~3 min
How Deduplication Works in Practice
The Layered Defense:
Production systems use deduplication at three layers: ingestion edge, streaming pipeline, and batch warehouse. Each layer has different trade offs between latency and completeness.
The Scale Challenge:
At 500,000 events per second peak (40 billion per day), even a 1 percent duplication spike during an incident pollutes 400 million records. A streaming dedup window storing event IDs for 24 hours must hold billions of keys in memory per partition. At 100 bytes per key, that is hundreds of gigabytes.
Business Logic Matters:
Technical dedup is not enough. You need business rules. For payment events, the combination of
1
Edge Layer: The application generates a globally unique
event_id for every action. Ingestion services treat requests as idempotent on this key and reject replays within a configurable window, typically 24 hours.2
Streaming Layer: A stateful processor keeps a window of recently seen event IDs, usually 1 to 24 hours. It drops obvious duplicates while maintaining p99 latency under 200 milliseconds. This catches most duplicates but misses late arriving events.
3
Batch Layer: Nightly jobs scan partitions, group by business keys like
order_id and user_id, timestamp, and apply deterministic tie breaking. For example, keep the latest updated_at or the one with valid payment status.Streaming Dedup Performance
200ms
P99 LATENCY
95%
CAUGHT
user_id, order_id, and payment_method_id might define uniqueness. For user profiles, you might merge records with the same email but different device IDs. The dedup key depends on what "same entity" means in your domain.
Companies like Netflix and LinkedIn use slowly changing dimension patterns in their warehouses. Each entity has a canonical version with effective timestamps. Even if raw logs remain noisy, consumers always query the deduplicated snapshot.💡 Key Takeaways
✓Three layer defense: edge idempotency checks, streaming dedup with 1 to 24 hour windows at p99 under 200ms, and nightly batch jobs with full history.
✓At 500,000 events per second, a 1 percent spike creates 400 million duplicates. Streaming catches 95 percent, batch corrects the rest.
✓Business keys define uniqueness: <code>order_id + user_id</code> for transactions, email for user profiles. Technical dedup alone is insufficient.
✓Tie breaking rules matter: keep latest <code>updated_at</code>, valid payment status, or highest data quality score when duplicates exist.
📌 Examples
1A streaming processor stores the last 12 hours of event IDs in RocksDB state. Late events beyond 12 hours pass through and are caught by nightly batch dedup.
2A batch job groups 2 billion order events by <code>order_uuid</code>. For each group, it keeps the row with the latest <code>updated_at</code> timestamp and marks others as superseded.
3Netflix uses slowly changing dimensions with effective dates. Consumers query the current snapshot, which is always deduplicated, even though raw Kafka logs contain duplicates.