Retry Policies: Exponential Backoff, Jitter, and Budgets

Retries are a control loop over transient failures such as timeouts, transport errors, and overload conditions that typically resolve within milliseconds to seconds. A robust retry policy requires five components: error classification to determine what to retry, bounded attempts to prevent infinite loops, exponential backoff to space out retries and reduce load, jitter to avoid synchronized retry storms, and a budgeted deadline so total client observed latency stays within Service Level Objectives (SLOs). Error classification is critical: retry connection resets, timeouts, throttling responses, and most 5xx server errors, but never retry validation errors, authentication failures, or explicit 4xx client errors that indicate a problem with the request itself rather than a transient condition.

Exponential backoff with full jitter is the industry standard for spacing retries. Start with a base delay such as 100 milliseconds, multiply by 2 for each subsequent retry, cap at 2 to 10 seconds, and randomize each delay between 0 and the backoff cap for that attempt. For example, the first retry waits a random duration between 0 and 100 milliseconds, the second between 0 and 200 milliseconds, the third between 0 and 400 milliseconds. AWS published research showing that full jitter significantly reduces synchronized retry storms during regional events compared to exponential backoff without jitter. With 100 millisecond base exponential backoff and full jitter, the expected additional wait before the second and third attempts is roughly 75 to 150 milliseconds and 150 to 300 milliseconds, which must fit within your end to end SLO budget.

Retry budgets prevent load amplification during degraded conditions. Set a per request deadline and propagate it downstream; abandon retries when the remaining time cannot accommodate another attempt. Track a retry budget at the client, for example allowing retries to consume at most 10 percent additional requests per second, to prevent overload during incidents. If your 99th percentile SLO is 300 milliseconds and median processing is 50 milliseconds, you might allow one retry at approximately 100 to 150 millisecond threshold, but not three retries that would push latency beyond the SLO. Pair retries with circuit breakers to fail fast on persistent faults and with token bucket or leaky bucket rate limiting to smooth bursts. AWS Software Development Kits (SDKs) implement exponential backoff with jitter for retryable categories, defaulting to single digit attempts with caps of seconds to tens of seconds.

💡 Key Takeaways

✓Retry only transient errors such as timeouts, connection resets, throttling, and 5xx responses; never retry validation, authentication, or 4xx errors that indicate a problem with the request.

✓Exponential backoff with full jitter, randomizing delay between 0 and the backoff cap, prevents synchronized retry storms that can overwhelm recovering services during regional outages.

✓With 100 millisecond base and exponential backoff, expected additional latency is approximately 75 to 150 milliseconds for the second attempt and 150 to 300 milliseconds for the third, which must fit within SLO budgets.

✓Retry budgets limit retries to a percentage of additional load, such as 10 percent more requests per second, preventing retry amplification from turning transient failures into cascading overload.

✓Set a per request deadline and abandon retries when remaining time cannot accommodate another round trip; propagate deadlines downstream to prevent deep retry chains.

✓AWS SDKs default to single digit retry attempts capped at seconds to tens of seconds, and AWS research demonstrated full jitter significantly reduces retry collisions compared to plain exponential backoff.

📌 Interview Tips

1A payment service with 300 millisecond p99 SLO and 50 millisecond median latency allows one retry at 100 to 150 millisecond timeout, giving the retry approximately 150 milliseconds to complete within the 300 millisecond budget.

2An AWS SDK retrying a DynamoDB request uses full jitter with 100 millisecond base: first retry waits 0 to 100 milliseconds (average 50), second waits 0 to 200 milliseconds (average 100), third waits 0 to 400 milliseconds (average 200), capped at 3 attempts.

3A service tracking a 10 percent retry budget at 10,000 requests per second base load allows at most 1,000 additional retry requests per second; when retry rate exceeds this, circuit breakers trip to fail fast.

4An internal RPC sets a 500 millisecond deadline and propagates it in request metadata. After the first 200 millisecond attempt times out and a 100 millisecond backoff, only 200 milliseconds remain, insufficient for another 200 millisecond attempt, so the retry is abandoned.

← Back to Idempotency & Retry Patterns Overview