Dead Letter Queues and Poison Message Handling

What is a Dead Letter Queue:

A Dead Letter Queue (DLQ) is a separate queue where messages that cannot be successfully processed are automatically moved after a configured number of delivery attempts. Without a DLQ, poison messages (permanently failing messages due to malformed payloads, schema mismatches, or application bugs) continuously retry, wasting consumer capacity and potentially masking healthy traffic. Amazon SQS allows configuring a maximum receive count (commonly 3 to 5 attempts) before automatic redrive to a DLQ.

DLQ as Operational Triage:

The DLQ becomes your operational triage point. Monitor DLQ depth and oldest message age as critical alerts: a rapidly growing DLQ indicates either a deployment bug affecting all messages or a category of input causing systematic failures. Set up dashboards showing DLQ inflow rate, top error types from consumer logs correlated with DLQ messages, and age distribution. Messages sitting in a DLQ for days represent lost business value and require investigation.

Multi-Layered Defense:

Handling poison messages requires multi-layered defense. First, validate and sanitize inputs before enqueuing to catch malformed data early. Second, implement defensive parsing in consumers with structured error logging that captures message ID and payload snippets for debugging. Third, use separate DLQs per message type or tenant to isolate blast radius. Fourth, build redrive tooling that allows safely replaying DLQ messages back to the main queue with rate limiting and sampling (replay 10% first to verify the fix).

❗ Remember: Common DLQ failures include schema evolution issues (new consumer cannot parse old formats), missing dependent data, oversized payloads, and transient dependency failures that look permanent. Too few retries (max 2) turn transient issues into DLQ floods; too many (max 20) waste capacity on hopeless retries.

💡 Key Takeaways

✓Configure maximum receive count between 3 and 5 attempts: fewer than 3 sends transient failures to DLQ prematurely; more than 5 wastes processing capacity on poison messages that will never succeed

✓DLQ metrics are operational early warning: DLQ inflow rate spiking from baseline 0.1% to 10% of enqueue rate indicates a new deployment or upstream change is systematically breaking message processing

✓Separate DLQs by message type or tenant: if account_created and order_placed messages share a DLQ, a schema bug in order processing floods the DLQ and obscures account creation issues; isolation improves root cause analysis

✓Build controlled redrive tooling: after deploying a fix, replay 10% of DLQ messages as a canary; if success rate is high, gradually increase to 50% then 100%; never bulk replay without rate limiting to avoid overwhelming dependencies

✓Track per message delivery count in logs: correlate messages with receiveCount greater than 3 to consumer error logs containing that message ID; this identifies which validation or parsing step is failing for forensic analysis

📌 Interview Tips

1Amazon SQS redrive policy: configure source queue with RedrivePolicy specifying maxReceiveCount of 5 and deadLetterTargetArn pointing to DLQ; after 5 unsuccessful deliveries (visibility timeout expires without acknowledgment), SQS automatically moves the message to the DLQ

2Netflix message handling: separate DLQs per service and message type; automated alerts trigger when DLQ depth exceeds threshold or oldest message age exceeds 1 hour; on call engineer uses internal tooling to sample DLQ messages, identify root cause, deploy fix, and replay with exponential backoff

← Back to Message Queue Fundamentals Overview