Networking & Protocols • TLS/SSL & EncryptionHard⏱️ ~3 min
TLS Failure Modes: Revocation, Zero RTT Replay, and Operational Pitfalls
Certificate revocation is one of the most brittle aspects of TLS at Internet scale. In theory, when a certificate is compromised, the issuing CA adds it to a Certificate Revocation List (CRL) or makes it queryable via Online Certificate Status Protocol (OCSP), and clients check before trusting the certificate. In practice, CRLs can grow to megabytes and propagate slowly, while OCSP queries add latency and often fail open (clients proceed if they cannot reach the OCSP responder, to avoid breaking connectivity when OCSP infrastructure has issues). Modern browsers and servers rely on OCSP stapling, where the server fetches its own OCSP response and includes it in the handshake, but stapling success rates vary and clients typically do not hard fail without it. The operational reality is that revocation cannot be relied upon for timely response to compromise. Instead, production systems emphasize short certificate lifetimes (90 days or less) and rapid rotation, accepting that a compromised key might be usable until natural expiration but limiting that window to weeks instead of months or years.
Zero RTT early data in TLS 1.3 offers dramatic performance benefits by allowing clients to send application data in the very first flight of the handshake, eliminating all handshake latency for resumed connections. However, this data is fundamentally replayable: an attacker who captures the initial client message can resend it to the same server or different servers, and the TLS layer cannot detect this as a replay. This restricts zero RTT to idempotent operations only. A GET request for a static asset is safe; a POST request creating a database record or financial transaction is not, as replay could duplicate the action. Many production systems disable zero RTT entirely for API endpoints and enable it only for static content served from CDNs. Even then, segregation must be enforced at the application layer, as TLS itself cannot distinguish safe from unsafe early data. Systems that do enable zero RTT for APIs must implement application level replay protection, such as single use tokens or timestamps with strict bounds, adding complexity.
Operational misconfigurations cause subtle but impactful failures. Session ticket key mismanagement is common: if ticket encryption keys are not rotated frequently (hours to one day), forward secrecy degrades because an attacker who steals an old ticket key can decrypt all sessions tied to tickets encrypted with that key. If ticket keys are not synchronized across a load balanced fleet, returning users randomly hit servers that cannot decrypt their tickets, causing resumption to fail and full handshakes to occur, spiking latency and CPU usage. Certificate chain issues also create hard to diagnose failures: serving an incomplete chain (missing intermediate certificates) causes some clients to fail validation while others succeed depending on whether they have cached the intermediate. MTU fragmentation during handshakes, especially with large RSA certificate chains or excessive TLS extensions, leads to packet loss on congested or lossy links and manifests as intermittent connection failures that are difficult to reproduce in testing but common in mobile networks. Clock skew on clients or servers causes "certificate not yet valid" or "certificate expired" errors that appear as authentication failures but are actually time synchronization problems.
💡 Key Takeaways
•OCSP queries add 50 to 200 ms latency and fail open in most browsers, meaning revocation cannot be relied upon for timely compromise response; short 90 day lifetimes limit exposure to weeks instead
•Zero RTT early data is replayable at the TLS layer; safe only for idempotent operations like GET requests for static content, never for writes or state changes without application level replay protection
•Session ticket keys not rotated within hours to 1 day undermine forward secrecy; leaked keys allow decryption of all past sessions encrypted with those keys
•Incomplete certificate chains (missing intermediates) cause validation failures in 10 to 30 percent of clients depending on their cached intermediate certificates, creating intermittent and hard to diagnose issues
•Large certificate chains (4 to 6 KB for RSA) exceed typical 1500 byte MTU and fragment into 3 to 4 packets, increasing failure rates by 5 to 15 percent on lossy mobile networks compared to 2 to 3 KB ECDSA chains in 2 packets
📌 Examples
Uber experienced a multi hour outage when a certificate expired and OCSP stapling failed simultaneously, demonstrating that revocation and expiry handling must be robust even when multiple systems fail
A major CDN disabled zero RTT on all API routes after discovering replayed POST requests were duplicating customer transactions, limiting zero RTT to only static asset domains
Amazon s2n TLS library emphasizes constant time code paths and minimal attack surface, responding to the reality that TLS implementations are frequent targets with dozens of CVEs across OpenSSL, GnuTLS, and other libraries