What is Data Deduplication?

Definition
Data Deduplication is the systematic process of identifying and resolving logically identical records that appear multiple times in your data platform, ensuring each entity exists exactly once in the canonical version of your data.
The Core Problem:

Duplicates corrupt business metrics. If an order appears three times in your events table, your revenue dashboard shows 3x the actual amount. If a user signup event is recorded twice, your growth metrics are inflated. For a platform processing 40 billion events per day, even a 0.1 percent duplicate rate introduces 40 million corrupt records daily.

Where Duplicates Come From:

Clients retry HTTP requests on timeout. Message buses deliver events multiple times due to network issues. Backfill jobs reprocess historical data. Mobile apps queue events offline and replay them later. CDC (Change Data Capture) systems emit the same row update multiple times during failures.

Two Key Dimensions:

First, exact versus fuzzy matching. Exact dedup relies on a stable identifier like order_id or event_uuid. Any repeated key means duplicate. Fuzzy dedup compares combinations of fields like name, email, timestamp, and location using similarity functions. It handles cases where no perfect key exists but is probabilistic.

Second, online versus offline dedup. Online dedup runs in streaming systems with latency targets under 100 milliseconds, processing events as they arrive. Offline dedup runs in batch jobs with minute to hour latency, scanning the full dataset to catch everything online systems miss.

❗ Remember: Deduplication is not a one time cleanup task. It is a continuous, layered strategy that prevents duplicates at ingestion, filters them in streaming, and corrects them in batch processing.

💡 Key Takeaways

✓Duplicates corrupt business metrics by overcounting revenue, users, and engagement. A 0.1 percent duplicate rate at 40 billion events per day creates 40 million corrupt records.

✓Exact dedup uses stable identifiers like order_id or event_uuid. Fuzzy dedup uses similarity on multiple fields and is probabilistic.

✓Online dedup runs in streaming with sub 100 millisecond latency. Offline dedup runs in batch with full historical context but minute to hour latency.

✓Duplicates enter from client retries, at least once message delivery, backfill jobs, and CDC system replays during failures.

📌 Interview Tips

1A mobile app retries a purchase request after a timeout. Without dedup, the user is charged twice and the order appears twice in analytics.

2A Kafka consumer crashes mid processing and restarts from the last committed offset, reprocessing the last 1,000 events and creating duplicates.

3A backfill job replays the last 7 days of events to fix a bug. Without proper dedup keys, 600 million events are duplicated in the warehouse.

← Back to Data Deduplication Strategies Overview