DLQ Operations: Metrics, Alerting, and Triage SLOs

Core Observability Metrics:

Treat DLQs as first-class production components with dedicated metrics: DLQ depth (total messages), arrival rate (msg/sec entering), drain rate (msg/sec leaving via redrive or expiration), oldest message age, and error category distribution. Amazon teams build dashboards with stacked area charts showing arrival vs drain rates.

Alerting Thresholds
Rate: 0.1-1% | Age: 5× SLO | Fingerprints: >10 in 15min
DLQ arrival >1% of main queue for 5-10 min = systemic failure. Oldest message >5× SLO = page oncall.
Triage SLOs:

Microsoft Azure customers commit to: initial triage within 15-30 minutes (human eyes on problem, error classification documented), root cause identification within 2-4 hours, complete DLQ drain within 24 hours for non-critical paths or 4 hours for critical revenue flows.

Required Metadata:

Each DLQ message should include: original timestamp, attempt counter, consumer version, error category and message, dependency response snapshots, correlation identifiers, and trace context for distributed tracing integration.

Dashboard Design:

Surface actionable insights: error category distribution as stacked areas, p50/p99 message age, redrive success rate trending, top N error fingerprints with sample message IDs. Amazon on-call playbooks include DLQ-specific runbooks with decision trees: schema errors → data platform team, auth failures → security team, dependency timeouts → capacity review.

💡 Key Takeaways

✓Core DLQ metrics include depth, arrival rate, drain rate, oldest message age, and error category distribution displayed as stacked area charts for trend analysis

✓Page when DLQ arrival rate exceeds 0.1 to 1 percent of main throughput sustained over 5 to 10 minutes, indicating systemic failure not isolated bad message

✓Age based alerts fire when oldest message exceeds five times normal end to end SLO: 30 second payment SLO triggers alert at 150 second DLQ age

✓Microsoft Azure enterprise SLOs: triage within 15 to 30 minutes, root cause within 2 to 4 hours, complete drain within 24 hours for non critical paths

✓Rich metadata per message required: timestamp, attempt count, consumer version, error category, dependency snapshots, correlation IDs, trace context

✓Google adds error fingerprint cardinality alerts: more than 10 distinct patterns in 15 minutes suggests widespread issue requiring immediate escalation

📌 Interview Tips

1Payment processing system with 30 second SLO pages on call when oldest DLQ message age hits 150 seconds, indicating five times SLO breach requiring immediate triage

2Dashboard shows error category shift from 80 percent timeout to 80 percent schema validation over 1 hour, automatically routing alert to data platform team per runbook

3Amazon team implements redrive success rate trending: after deploying schema fix, redrive achieves 98 percent success on 5,000 message canary, green lighting full drain

← Back to Dead Letter Queues & Error Handling Overview