Message Queues & StreamingMessage Queue FundamentalsMedium⏱️ ~3 min

Dead Letter Queues and Poison Message Handling

A dead letter queue (DLQ) is a separate queue where messages that cannot be successfully processed are automatically moved after a configured number of delivery attempts. Without a DLQ, poison messages (permanently failing messages due to malformed payloads, schema mismatches, or application bugs) continuously retry, wasting consumer capacity and potentially masking healthy traffic. Amazon SQS allows configuring a maximum receive count (commonly 3 to 5 attempts) before automatic redrive to a DLQ; Azure Service Bus supports similar max delivery count policies per queue or subscription. The DLQ becomes your operational triage point. You should monitor DLQ depth and oldest message age as critical alerts: a rapidly growing DLQ indicates either a deployment bug affecting all messages or a category of input causing systematic failures. Set up dashboards showing DLQ inflow rate, top error types from consumer logs correlated with DLQ messages, and age distribution. Messages sitting in a DLQ for days represent lost business value and require manual investigation and potential reprocessing after fixes. Handling poison messages requires multi layered defense. First, validate and sanitize inputs before enqueuing to catch malformed data early. Second, implement defensive parsing in consumers with structured error logging that captures message ID and payload snippets for debugging. Third, use separate DLQs per message type or tenant to isolate blast radius. Fourth, build redrive tooling that allows safely replaying DLQ messages back to the main queue with rate limiting and sampling (replay 10% first to verify the fix before replaying all). Common failure patterns include schema evolution issues (new consumer version can't parse old message formats), missing dependent data (message references an order that was deleted), oversized payloads (message exceeds size limits after base64 encoding), and transient dependency failures that look permanent (downstream API returns 500 for hours). The last case is why setting appropriate retry counts matters: too few retries (max 2) turn transient issues into DLQ floods; too many (max 20) delay identifying real poison messages and waste capacity on hopeless retries.
💡 Key Takeaways
Configure maximum receive count between 3 and 5 attempts: fewer than 3 sends transient failures to DLQ prematurely; more than 5 wastes processing capacity on poison messages that will never succeed
DLQ metrics are operational early warning: DLQ inflow rate spiking from baseline 0.1% to 10% of enqueue rate indicates a new deployment or upstream change is systematically breaking message processing
Separate DLQs by message type or tenant: if account_created and order_placed messages share a DLQ, a schema bug in order processing floods the DLQ and obscures account creation issues; isolation improves root cause analysis
Build controlled redrive tooling: after deploying a fix, replay 10% of DLQ messages as a canary; if success rate is high, gradually increase to 50% then 100%; never bulk replay without rate limiting to avoid overwhelming dependencies
Track per message delivery count in logs: correlate messages with receiveCount greater than 3 to consumer error logs containing that message ID; this identifies which validation or parsing step is failing for forensic analysis
📌 Examples
Amazon SQS redrive policy: configure source queue with RedrivePolicy specifying maxReceiveCount of 5 and deadLetterTargetArn pointing to DLQ; after 5 unsuccessful deliveries (visibility timeout expires without acknowledgment), SQS automatically moves the message to the DLQ
Netflix message handling: separate DLQs per service and message type; automated alerts trigger when DLQ depth exceeds threshold or oldest message age exceeds 1 hour; on call engineer uses internal tooling to sample DLQ messages, identify root cause, deploy fix, and replay with exponential backoff
← Back to Message Queue Fundamentals Overview