Message Queues & StreamingDead Letter Queues & Error HandlingHard⏱️ ~3 min

DLQ Redrive: Safe Reprocessing and Rate Limiting

The Risk: Redriving DLQ messages is high risk—messages failed for a reason, and bulk reprocessing can overwhelm fragile dependencies. Amazon teams commonly cap redrive throughput to 1-5% of normal traffic volume using token bucket algorithms. Safe Redrive Workflow:
📋 Graduated Redrive Steps
1. Deploy the fix (schema update, dependency config, code patch)
2. Process canary batch of 100-1,000 messages
3. Require >95% success rate before full redrive
4. Create dedicated rate-limited subscription for redrive
5. If success drops below threshold, pause automatically and alert
Idempotency Enforcement: At-least-once delivery guarantees duplicates. Messages may have partially succeeded before failing (wrote to database but failed external API). Use idempotency keys at business operation granularity: order IDs, payment IDs stored in bounded retention dedup store (7-30 days). Consumer checks this store before processing and skips if key exists. Adaptive Rate Limiting: If dependency error rates climb during redrive, automatically throttle or pause. Amazon teams implement circuit breakers monitoring error rate over sliding windows—5 errors in 10 requests triggers open state for 60 seconds. Microsoft Azure customers layer multiple safeguards: per-message-type quotas, dependency-specific rate limits, and global throughput caps with manual override requiring manager approval.
💡 Key Takeaways
Cap redrive throughput to 1 to 5 percent of normal traffic volume using token bucket or leaky bucket rate limiting to avoid overwhelming dependencies
Require canary batch verification with 100 to 1,000 messages achieving greater than 95 percent success rate before enabling full redrive
Google Pub/Sub uses dedicated rate limited subscriptions for redrive, isolating reprocessing from normal flow and preserving multi tenant consumer isolation
Idempotency keys at business operation level (order ID, payment ID) with bounded 7 to 30 day retention prevent duplicate side effects from at least once delivery
Adaptive rate limiting monitors downstream error rates during redrive, automatically throttling or pausing when errors exceed thresholds (circuit breaker pattern)
Amazon Prime Day preparation includes raising circuit breaker thresholds and widening backoff caps from 30 seconds to 2 or 5 minutes to handle 10 to 50 times baseline load
📌 Interview Tips
1After fixing schema validation bug, team redrives 50,000 DLQ messages at 500 messages per second (5 percent of 10,000 normal throughput) with circuit breaker monitoring downstream database error rate
2E-commerce order processing uses order identifiers as idempotency keys stored in Redis with 14 day TTL; redrive after payment gateway fix skips 20 percent of messages already partially processed
3Microsoft Azure customer implements three layer rate limiting: 100 messages per second per message type, 500 total across types, 1,000 global cap requiring VP approval to override during incident
← Back to Dead Letter Queues & Error Handling Overview