Design Fundamentals • Availability & ReliabilityEasy⏱️ ~2 min
Availability Fundamentals: Nines, MTBF, MTTR, and Composition
Availability measures the fraction of time a system is operational and serving requests, computed as Uptime / (Uptime + Downtime) or equivalently MTBF / (MTBF + MTTR). The industry expresses availability in nines: 99.9% (three nines) means roughly 43.2 minutes of downtime per month, 99.99% (four nines) allows only 4.3 minutes monthly, and 99.999% (five nines) permits just 26 seconds. For example, if your MTBF is 1000 hours and MTTR is 1 hour, availability is approximately 1000 / 1001 or 99.9%. In production, SLIs capture availability by tracking the percentage of requests that succeed and meet latency thresholds, such as 99% of requests completing under 300 ms with non empty results.
A critical trap is that availability composes multiplicatively in series: if your service depends on five independent backend services each at 99.9%, the upper bound of combined availability is roughly 0.999 raised to the fifth power, yielding approximately 99.5% even before network, client, or deployment failures enter the picture. This means every dependency erodes your availability budget. Gmail publishes a 99.9% monthly SLA, which translates to roughly 43.2 minutes of allowable downtime. Amazon S3 guarantees 99.99% availability for its Standard storage class, permitting at most 52.6 minutes of downtime per year. When designing for high availability, always map your nines target to concrete downtime minutes and validate that your dependency chain can realistically achieve it.
💡 Key Takeaways
•Availability equals Uptime divided by total time or MTBF divided by the sum of MTBF and MTTR. An MTBF of 1000 hours and MTTR of 1 hour yields 99.9% availability.
•Each nine corresponds to specific downtime: 99.9% allows 43.2 minutes per month, 99.99% allows 4.3 minutes, and 99.999% allows only 26 seconds.
•Availability composes poorly in series. Five dependencies at 99.9% each yield approximately 99.5% combined availability, illustrating how dependencies erode your budget.
•Real SLIs measure both success rate and latency, such as 99% of requests completing under 300 ms, reflecting actual user experience beyond simple uptime.
•Gmail guarantees 99.9% monthly (43.2 min downtime), S3 guarantees 99.99% yearly (52.6 min), and Google Cloud Spanner offers 99.999% multi region (5.26 min yearly).
•Calculate the business cost of downtime before committing to an extra nine. Moving from 99.9% to 99.99% saves roughly 8 hours yearly but may double infrastructure costs.
📌 Examples
A payment service at 99.95% depends on an auth service at 99.9% and a fraud detection service at 99.9%. The best case combined availability is 0.9995 × 0.999 × 0.999 ≈ 99.75%, worse than any single component.
Netflix runs active active across multiple AWS regions to tolerate full region failures. Their architecture ensures that even if one region goes down completely, users experience no downtime.
Amazon S3 Standard class promises 99.99% availability (52.6 min/year max downtime) and 11 nines durability (99.999999999%), meaning loss of one object is expected no more than once per 10^11 object years.