Message Queues & Streaming • Dead Letter Queues & Error HandlingHard⏱️ ~3 min
DLQ Failure Modes: Poison Messages and Ordering Violations
Poison Message Loops:
The most insidious DLQ failure mode: messages that always fail due to schema incompatibility or code bugs fill the DLQ, and redriving without fixing root cause cycles them back. Amazon engineers maintain blocklists of message fingerprints (hashes of payload structure and error signature) that prevent automatic redrive, requiring explicit manual override with documented fix verification.
Retry Storm Collapse:
If a DLQ accumulated 100,000 messages during a database outage and you redrive at 10,000/sec, you might exceed database connection pool (100-500 connections) and trigger cascading timeouts. Solution: graduated redrive starting at 100 msg/sec, monitor p99 latency and error rate for 5 minutes, double throughput if healthy.
PII and Retention Risk:
DLQs often retain messages longer than main queues (14-30 days vs 1-7 days), and failed messages may contain sensitive data that should have been deleted. Microsoft Azure customers implement scrubbing pipelines that tokenize or redact PII fields after initial triage, enforce encryption at rest with key rotation, and explicit purge jobs to avoid unbounded storage growth.
⚠️ FIFO Ordering Danger
In ordered queues, moving a message to DLQ unblocks later messages for the same partition key. Example: bank ledger processing deposit→withdrawal for account 12345—if deposit fails and moves to DLQ, withdrawal processes first, creating negative balance. Mitigation: key-level pausing until DLQ message manually resolved.
💡 Key Takeaways
✓Poison message loops occur when redriving without fixing root cause; fingerprint blocklists prevent automatic redrive of known bad patterns requiring manual override
✓FIFO queue ordering violations happen when DLQ removes a message, unblocking later messages for same key; key level pausing prevents processing rest of sequence until resolved
✓Retry storms during bulk redrive can exhaust dependency connection pools (100 to 500 connections typical); graduated redrive doubles throughput every 5 minutes with health monitoring
✓DLQ retention (14 to 30 days) often exceeds main queue (1 to 7 days), creating PII compliance risk; scrubbing pipelines tokenize sensitive fields after triage period
✓TTL interactions vary by platform with some systems never auto expiring DLQ messages, requiring explicit purge jobs to prevent unbounded storage growth and cost
✓Multi tenant topics cannot safely redrive to shared topics as other consumers would receive messages; per consumer DLQs with dedicated retry topics preserve isolation
📌 Interview Tips
1Bank ledger system implements key level pausing: when deposit message for account 12345 fails, all messages for that account queue until DLQ resolved to preserve transaction ordering
2After schema change creates 50,000 DLQ messages, team uses fingerprint blocklist preventing redrive until contract tests verify backward compatibility across all consumer versions
3Azure enterprise customer implements three tier retention: main queue 7 days, DLQ full payload 14 days, DLQ tokenized PII only 30 days with automated scrubbing at each transition