Transient vs Permanent Failures: Error Classification Strategy

Error Classification:

The foundation of effective DLQ strategy is distinguishing transient failures that warrant retry from permanent failures that should be dead lettered immediately.

⏳ Transient (Retry)
Network timeouts, HTTP 5xx, rate limits (429), brief dependency brownouts
🛑 Permanent (DLQ Now)
Schema validation errors, missing fields, auth failures (401/403), malformed payloads
Retry Policy:

Transient failures use exponential backoff with full jitter. Start at 100ms, double each attempt (200ms → 400ms → 800ms), cap at 30-60 seconds for hot path or 15-30 minutes for batch. After max attempts (3-10), move to DLQ even if failure appears transient.

Smart Classification:

Permanent failures should skip retries entirely. When a consumer detects invalid schema, immediately route to DLQ with error classification metadata. Microsoft Azure allows application code to explicitly dead letter messages, enabling smart classification at the business logic layer.

Ambiguity Handling:

A dependency returning HTTP 500 might indicate transient overload or a permanent bug. Address this with error fingerprinting: hash error message, stack trace, and request characteristics to detect patterns. If the same fingerprint appears across multiple messages, it likely indicates a permanent issue requiring code fix.

💡 Key Takeaways

✓Transient failures (timeouts, 5xx, rate limits) should retry with exponential backoff starting at 100 ms and capping at 30 to 60 seconds, with 3 to 10 max attempts

✓Permanent failures (schema errors, missing fields, authorization failures) should skip retries and move immediately to DLQ to avoid wasting resources

✓Full jitter on backoff intervals prevents thundering herd when many consumers retry simultaneously after a shared dependency recovers

✓Error fingerprinting hashes error messages, stack traces, and payload characteristics to detect patterns indicating permanent bugs versus transient issues

✓Microsoft Azure allows application code to explicitly dead letter messages, enabling business logic layer classification beyond simple attempt count thresholds

✓Ambiguous HTTP 500 errors require fingerprint analysis: same fingerprint across multiple messages likely indicates permanent bug needing code fix not retry

📌 Interview Tips

1A payment service classifies database connection timeout as transient with 5 retry attempts, but invalid credit card format as permanent with immediate DLQ routing and alerting

2During AWS region brownout, message processing latency spikes from 50 ms to 2 seconds; exponential backoff with jitter spreads retry load over 30 seconds preventing further overload

3Schema evolution at Google introduces new required field; old consumers lacking field validation logic see 100 percent DLQ rate until backward compatible schema deployed

← Back to Dead Letter Queues & Error Handling Overview