Design Fundamentals • Availability & ReliabilityMedium⏱️ ~2 min
Reliability vs Availability: Correctness Over Uptime
Reliability measures the probability that a system performs correctly over a time interval, modeled as R(t) = e^(−λt) for constant failure rate λ. Unlike availability, which only cares whether the system is up, reliability focuses on correctness: no errors, no data corruption, no stale or wrong results. A system can be highly available yet unreliable, always responding but frequently returning incorrect data due to bugs, race conditions, or silent bit rot. Conversely, a system can be highly reliable but not very available, always correct when operational but suffering frequent outages. For distributed systems, reliability encompasses durability (data survives failures), ordering (events processed correctly), deduplication (no duplicate side effects), and idempotency (retries do not corrupt state).
Amazon S3 illustrates this distinction clearly: it offers 99.99% availability (uptime) and 11 nines durability (99.999999999%), meaning stored data remains intact and uncorrupted even if the service occasionally becomes unavailable. Google Cloud Spanner maintains strong consistency via quorum writes and TrueTime, ensuring reliability of read results at the cost of tens of milliseconds added write latency for cross region coordination and commit wait periods. Production systems tie availability and reliability together through SLOs and error budgets. A 99.9% monthly availability SLO leaves a 0.1% error budget, roughly 43.2 minutes per month, to spend on deployments, maintenance, or controlled experiments. Mechanisms that improve reliability include checksums for detecting corruption, end to end validation to catch application bugs, quorum protocols for consistency, idempotency tokens to prevent duplicate operations, and transactional semantics to enforce invariants.
💡 Key Takeaways
•Reliability is about correctness and consistency (no errors, no corruption), modeled as R(t) = e^(−λt), while availability is about uptime regardless of correctness.
•A system can be highly available but unreliable (always up, frequently wrong) or highly reliable but not very available (always correct when up, often down).
•Amazon S3 guarantees 99.99% availability but 11 nines durability (99.999999999%), illustrating that data integrity and uptime are separate guarantees.
•Google Spanner achieves strong consistency and high reliability via quorum writes and TrueTime, but trades off tens of milliseconds in write latency for cross region coordination.
•Reliability mechanisms include quorum protocols, checksums, end to end validation, idempotency tokens, deduplication windows, and transactional semantics to enforce invariants.
•Error budgets tie both together: a 99.9% SLO gives you 43.2 minutes monthly to spend on changes, with the remainder allocated to maintain reliability and availability.
📌 Examples
A caching layer that always responds in 5 ms but serves stale data 5% of the time is highly available (always up) but unreliable (frequently incorrect).
Dynamo style systems prefer availability under partitions using sloppy quorum and hinted handoff, accepting eventual consistency (lower reliability) to remain available during network splits.
A financial ledger system that uses synchronous replication and strict serializability may reject writes during partitions (lower availability) to maintain correctness and prevent double spending (high reliability).
Google Workspace Gmail publishes 99.9% availability but also performs background integrity checks and maintains audit logs to ensure message reliability and prevent silent data loss.