Caching • Distributed CachingEasy⏱️ ~3 min
What is Distributed Caching and Why Use It?
Distributed caching places a shared in memory data layer between application servers and backend storage systems like databases. Instead of one large cache server, the cache is spread across multiple nodes in a cluster, with each node holding a portion of the total cached data. Keys are partitioned across these nodes using techniques like consistent hashing with virtual nodes, allowing the total memory capacity and request throughput to scale horizontally by adding more cache servers. This architecture turns expensive database queries that might take 10 to 100 milliseconds into fast memory lookups that complete in under 1 millisecond at median and low single digit milliseconds at 99th percentile.
The performance gains are dramatic when hit ratios are high. Moving from a 90% hit rate to 99% hit rate reduces database traffic by 10 times, because nine out of ten misses now become hits. At scale, even small drops matter: Meta's social graph caching layer serves many millions of requests per second, and a 1 to 2 point drop in hit rate translates into hundreds of thousands of extra backend queries per second. Netflix EVCache handles multi million requests per second with same availability zone (AZ) p50 latency well below 1 ms and p99 around 1 to 2 ms. AWS DynamoDB Accelerator (DAX) advertises microsecond read latencies and up to 10 times faster reads compared to direct database access, handling millions of requests per second per cluster.
Distributed caches accept bounded staleness in exchange for throughput and availability. Data freshness is controlled through time to live (TTL) expiration, explicit invalidation when data changes, or write propagation strategies like cache aside, read through, write through, and write back. Most large production systems prefer eventual consistency models with short TTLs rather than strong consistency, because strict coherence would require synchronous invalidation across all cache nodes and backend systems, adding latency and reducing availability during network partitions or node failures.
💡 Key Takeaways
•In memory cache gets complete in sub millisecond at p50 and low single digit milliseconds at p99 within a region, while database queries take 10 to 100 times longer, making cache hits 10 to 100 times faster than backend lookups.
•Hit ratio is the critical metric: increasing from 90% to 99% reduces backend load by 10 times. At Meta scale, a 2% hit rate drop can add hundreds of thousands of queries per second to the database layer.
•Horizontal scaling works by partitioning keys across nodes using consistent hashing with virtual nodes, so adding cache servers increases both total memory capacity and aggregate throughput linearly.
•Replication within the cache layer uses per partition primaries with 1 to 2 replicas to provide high availability and enable read fan out, so reads can be served from any replica while writes go to the primary.
•Most systems trade strong consistency for performance and availability, using TTL based expiration and eventual consistency models rather than synchronous invalidation, accepting staleness windows of seconds to minutes.
•Real production examples show massive scale: Netflix EVCache handles multi million requests per second with p99 latency of 1 to 2 ms, AWS DAX delivers microsecond reads at millions of requests per second, and CDN edge caches with 90 to 99% hit rates cut origin load by 10 to 100 times.
📌 Examples
Meta (Facebook) uses a massive memcache tier fronting sharded MySQL for the TAO social graph system. Regional cache pools serve millions of requests per second with sub millisecond median latency. Techniques include client side sharding to avoid proxy hops, region locality to prevent cross datacenter network costs, and lease based gets to prevent thundering herds when popular keys expire.
Netflix EVCache is a memcache based system for personalization and metadata. Data is replicated across 3 availability zones for resilience, but clients prefer same AZ reads to keep tail latency low and avoid cross AZ transfer costs. Short TTLs of seconds to minutes favor freshness for rapidly changing recommendation data.
Amazon DynamoDB Accelerator (DAX) is a write through and read through cache that integrates tightly with DynamoDB, providing microsecond reads compared to typical 1 to 5 ms DynamoDB response times. It maintains API level eventual consistency semantics while delivering 10 times performance improvement for read heavy workloads.