ETL/ELT Patterns • Data Deduplication StrategiesHard⏱️ ~3 min
When to Dedup: Choosing Between Strategies
The Core Decision:
Deduplication is not free. It adds latency, memory, and complexity. The question is not "should we deduplicate" but "where and how much."
Use Streaming Dedup When:
You need low latency real time dashboards or experiment results. Users expect metrics to update within seconds. You can tolerate 1 to 5 percent duplication in live views, knowing nightly batch jobs will correct the canonical data. Examples include live engagement dashboards, A/B test result previews, and operational monitoring.
Streaming dedup keeps p99 latency under 200 milliseconds while catching 95 percent of duplicates. The remaining 5 percent are late arrivals or edge cases that batch processing handles overnight.
Use Batch Dedup When:
You need perfect correctness for financial reporting, compliance audits, or billing. Even 0.1 percent error is unacceptable. Latency of hours is acceptable because reports run daily or weekly. Examples include revenue reconciliation, regulatory filings, and customer invoicing.
Batch dedup scans the full dataset with complete historical context. It can apply complex business rules like multi field matching, hierarchical tie breaking, and cross referencing with external systems.
Compare Alternatives:
Another approach is pushing dedup upstream to the application layer. Services generate idempotency keys and reject duplicate API calls before they reach the data platform. This reduces downstream complexity but requires tight coordination with product engineering and does not help with operational errors like backfill jobs or CDC replays.
Some teams choose never to delete duplicates. Instead, they mark the canonical record and flag others as superseded. Queries include a filter like
Decision Framework:
First, measure your duplicate rate and late arrival distribution. If 99 percent of events arrive within 1 hour, a 6 hour streaming window catches nearly everything. If late arrivals span days, you need batch correction.
Second, define correctness requirements per use case. Real time operational metrics can tolerate 2 to 5 percent error. Billing and compliance need zero tolerance. Run both streaming and batch for different consumers.
Third, calculate cost. Streaming dedup adds 10 to 30 milliseconds per event. For 500,000 events per second, that is significant CPU. Batch dedup adds storage and compute for nightly jobs. If duplicates are rare (under 0.01 percent), lightweight checks may suffice.Streaming Only
p99 < 200ms, 95% caught, 5% late arrivals missed
vs
Streaming + Batch
Real time dashboards plus nightly full correction
WHERE is_canonical = true. This preserves audit trails and makes dedup reversible, but doubles storage for hot tables and complicates every query.
💡 Key Takeaways
✓Streaming dedup targets p99 under 200ms and catches 95 percent of duplicates. Use for real time dashboards where 1 to 5 percent error is acceptable.
✓Batch dedup provides perfect correctness with full history at hour scale latency. Required for financial reports, billing, and compliance where zero error tolerance exists.
✓Application layer idempotency reduces downstream complexity but needs tight product engineering coordination and misses operational errors like backfills.
✓Never deleting duplicates but marking canonical records preserves audit trails but doubles storage and complicates queries with <code>is_canonical</code> filters.
📌 Examples
1A real time experiment dashboard uses streaming dedup with 12 hour windows. Nightly batch jobs recompute final results with perfect dedup for archived reports.
2A billing system relies solely on batch dedup with full dataset scans. Even 0.01 percent duplicate rate would charge customers incorrectly and violate SLAs.
3An API gateway implements idempotency keys per request. This prevents most duplicates at the source but a backfill job still creates duplicates, requiring warehouse dedup.
4A company marks duplicates with <code>superseded_by_id</code> pointing to the canonical record. Queries filter on <code>superseded_by_id IS NULL</code>, but the table is 2x larger.