DLQ Redrive: Safe Reprocessing and Rate Limiting

The Risk:

Redriving DLQ messages is high risk—messages failed for a reason, and bulk reprocessing can overwhelm fragile dependencies. Amazon teams commonly cap redrive throughput to 1-5% of normal traffic volume using token bucket algorithms.

Safe Redrive Workflow:📋 Graduated Redrive Steps

1. Deploy the fix (schema update, dependency config, code patch)

2. Process canary batch of 100-1,000 messages

3. Require >95% success rate before full redrive

4. Create dedicated rate-limited subscription for redrive

5. If success drops below threshold, pause automatically and alert
Idempotency Enforcement:

At-least-once delivery guarantees duplicates. Messages may have partially succeeded before failing (wrote to database but failed external API). Use idempotency keys at business operation granularity: order IDs, payment IDs stored in bounded retention dedup store (7-30 days). Consumer checks this store before processing and skips if key exists.

Adaptive Rate Limiting:

If dependency error rates climb during redrive, automatically throttle or pause. Amazon teams implement circuit breakers monitoring error rate over sliding windows—5 errors in 10 requests triggers open state for 60 seconds. Microsoft Azure customers layer multiple safeguards: per-message-type quotas, dependency-specific rate limits, and global throughput caps with manual override requiring manager approval.

💡 Key Takeaways

✓Cap redrive throughput to 1 to 5 percent of normal traffic volume using token bucket or leaky bucket rate limiting to avoid overwhelming dependencies

✓Require canary batch verification with 100 to 1,000 messages achieving greater than 95 percent success rate before enabling full redrive

✓Google Pub/Sub uses dedicated rate limited subscriptions for redrive, isolating reprocessing from normal flow and preserving multi tenant consumer isolation

✓Idempotency keys at business operation level (order ID, payment ID) with bounded 7 to 30 day retention prevent duplicate side effects from at least once delivery

✓Adaptive rate limiting monitors downstream error rates during redrive, automatically throttling or pausing when errors exceed thresholds (circuit breaker pattern)

✓Amazon Prime Day preparation includes raising circuit breaker thresholds and widening backoff caps from 30 seconds to 2 or 5 minutes to handle 10 to 50 times baseline load

📌 Interview Tips

1After fixing schema validation bug, team redrives 50,000 DLQ messages at 500 messages per second (5 percent of 10,000 normal throughput) with circuit breaker monitoring downstream database error rate

2E-commerce order processing uses order identifiers as idempotency keys stored in Redis with 14 day TTL; redrive after payment gateway fix skips 20 percent of messages already partially processed

3Microsoft Azure customer implements three layer rate limiting: 100 messages per second per message type, 500 total across types, 1,000 global cap requiring VP approval to override during incident

← Back to Dead Letter Queues & Error Handling Overview