ETL/ELT Patterns • Data Deduplication StrategiesMedium⏱️ ~3 min
Deduplication at Scale: Memory and Performance Trade offs
The Memory Problem:
Storing every event ID you have ever seen is impossible. At 500,000 events per second, you generate 43 billion events per day. If each event ID is 36 bytes (UUID) plus 64 bytes of overhead, that is 100 bytes per key. Keeping one day of state requires 4.3 terabytes of memory per partition. This does not fit in RAM.
Window Based State:
The most common solution is bounded windows. Keep only the last N hours of event IDs in memory. Events older than the window are assumed unique, even if they are duplicates. This trades completeness for bounded memory.
For example, a 24 hour window at 100,000 events per second stores 8.6 billion keys. At 100 bytes each, that is 860 gigabytes. Distributed across 10 partitions, each holds 86 gigabytes, which fits in memory with RocksDB backing. Late arrivals beyond 24 hours bypass dedup but represent less than 1 percent of traffic in most systems.
Approximate Structures:
Bloom filters reduce memory dramatically. A Bloom filter for 1 billion keys with 1 percent false positive rate requires only 1.2 gigabytes instead of 100 gigabytes. However, false positives mean some unique events are incorrectly flagged as duplicates. False negatives are impossible, so you never accept a duplicate, but you may reject valid events.
In practice, many systems use Bloom filters for hot path dedup in streaming, then rely on batch jobs to detect and correct any false positives. The batch layer has full context and can reprocess with exact logic.
Batch Dedup at Scale:
Batch dedup scans billions of rows and groups by dedup keys. The grouping operation is a distributed shuffle, which can bottleneck on network and memory. A single heavy hitter key (like a bot user_id appearing millions of times) can overload one reducer.
The solution is two stage aggregation or sampling. First, group by a coarse key like date and region to spread load. Then within each group, apply secondary partitioning on user_id hash. This prevents any single reducer from handling more than a few million records.
Window Size Impact
1 HOUR
36 GB
→
24 HOURS
860 GB
→
7 DAYS
6 TB
⚠️ Common Pitfall: Setting an unbounded dedup window or using exact structures for high cardinality keys. This causes out of memory errors during traffic spikes or backfills.
"At 100,000 events per second, a 24 hour window requires 860 GB of state. Going to 7 days means 6 TB, which does not fit in memory. Choose window size based on your late arrival distribution, not wishful thinking."
💡 Key Takeaways
✓Exact dedup for 43 billion events per day requires 4.3 TB of memory per partition. Bounded windows trade completeness for bounded state.
✓A 24 hour window at 100,000 events per second stores 8.6 billion keys in 860 GB of memory, split across partitions for manageability.
✓Bloom filters reduce memory from 100 GB to 1.2 GB for 1 billion keys with 1 percent false positive rate, but may incorrectly reject valid events.
✓Heavy hitter keys in batch shuffle can overload single reducers. Use two stage aggregation or secondary partitioning to spread load evenly.
📌 Examples
1A Flink job with 10 partitions keeps 24 hour state in RocksDB. Each partition handles 10,000 events per second and stores 86 GB of keys, staying within memory limits.
2A Spark batch job groups 5 billion events by <code>order_id</code>. One bot account has 10 million events, overloading a reducer. Adding a secondary hash partition on <code>user_id</code> spreads the load.
3Using a Bloom filter with 0.5 percent false positive rate, a streaming system incorrectly flags 2 million valid events per day as duplicates, which the nightly batch job corrects.