Design Fundamentals • Availability & ReliabilityHard⏱️ ~3 min
SLIs, SLOs, Error Budgets, and the Economics of Availability
Service Level Indicators (SLIs) are quantitative measures of user experience, such as the percentage of requests that succeed and complete within a latency threshold. A robust SLI might be 99% of search requests complete under 300 ms and return non empty results. Service Level Objectives (SLOs) set targets for SLIs over a time window, such as 99.9% monthly availability. An SLO of 99.9% monthly availability translates to an error budget of 0.1%, or roughly 43.2 minutes of allowable downtime per month. This budget can be spent on deployments, experiments, or accepted as risk from dependencies. Error budgets flip the reliability conversation: instead of demanding zero downtime, teams agree on acceptable risk and spend the budget deliberately. Fast burn rate alerts detect acute issues (error rate spikes consuming budget rapidly), while slow burn rate alerts catch chronic degradation (steady elevated errors draining budget over days).
The economics of availability guide architectural choices. Moving from 99.9% to 99.99% reduces yearly downtime by roughly 8 hours (from 8.76 hours to 52.6 minutes), but often requires doubling capacity to add N plus 1 redundancy or multi AZ replication. If an extra nine avoids 8 hours of downtime at $50,000 per hour lost revenue, the value is $400,000 per year. If achieving it costs $800,000 in additional infrastructure, invest instead in faster incident response, graceful degradation, and automated recovery to minimize user impact during partial outages. Canary deployments and progressive rollouts enforce error budgets: ship to 1% of traffic and block rollout if error rate exceeds 0.1% or p99 latency regresses by more than 5% for N consecutive minutes. Load shedding and backpressure apply token buckets or queue limits to reject low priority traffic when saturation exceeds thresholds, preserving availability for core user flows. Feature flags and kill switches allow remote disabling of non critical features that consume error budget during incidents.
💡 Key Takeaways
•SLIs quantify user experience, such as 99% of requests completing under 300 ms with non empty results, capturing both success rate and latency.
•SLOs set targets for SLIs over time windows. A 99.9% monthly SLO leaves a 0.1% error budget, roughly 43.2 minutes per month, to spend on changes and risk.
•Error budgets flip reliability discussions from zero downtime to acceptable risk, allowing deliberate spending on deployments and experiments within the budget.
•Fast burn rate alerts detect acute issues (rapid error rate spikes), while slow burn rate alerts catch chronic degradation (steady elevated errors over days).
•Economics of availability: moving from 99.9% to 99.99% saves roughly 8 hours yearly but may double infrastructure cost. Calculate revenue impact versus cost before committing.
•Canary deployments enforce error budgets by shipping to 1% of traffic and blocking rollout if error rate exceeds 0.1% or p99 latency regresses by more than 5%.
📌 Examples
Gmail publishes a 99.9% monthly availability SLO, meaning 43.2 minutes of allowable downtime per month. SRE teams use error budgets to balance feature velocity with reliability.
A payment service defines an SLI: 99.9% of transactions complete under 500 ms with successful authorization. Monthly SLO is 99.95%, leaving 21.6 minutes error budget.
An e commerce checkout flow calculates that each hour of downtime costs $50k in lost revenue. An extra nine (99.9% to 99.99%) saves 8 hours yearly ($400k) but costs $600k in multi AZ replication, making it uneconomical.
A search service ships a new ranking model to 1% of users. Error rate increases from 0.05% to 0.2%, breaching the 0.1% error budget threshold, triggering automatic rollback.
During a DDoS attack, a load balancer applies token bucket rate limiting and rejects low priority read requests (autocomplete suggestions) to preserve error budget for high priority writes (user signups).