TSDB Sharding and Hot Partition Problem

Distributing a Time Series Database (TSDB) across multiple nodes requires careful sharding strategy to avoid creating hot partitions that become performance bottlenecks. The naive approach of sharding only by time creates a fundamental problem: all writes for the current time window collapse into a single "now" partition, overwhelming that shard while leaving historical shards idle.

The solution is compound sharding using both time bucket and series identifier. A series identifier is typically a hash of the measurement name plus all tag key value pairs. For example, a metric "cpu.usage,host=server42,region=us-west" gets hashed to produce a stable identifier, then routed to a shard determined by (time_bucket, series_hash mod shard_count). This spreads the "now" traffic across all shards while maintaining time locality for efficient range scans. However, this introduces a second failure mode: skewed tag distributions.

When a single tag value dominates traffic (celebrity user, popular customer, major service), it creates load imbalance even with series hashing. For instance, if 80 percent of your metrics come from one customer, their series identifiers will hash somewhat uniformly but the volume is still concentrated. Systems observe p99 latency jumping from 10ms to 500ms when hot partitions saturate. Mitigation strategies include dynamic rebalancing based on observed load, ingestion side load shedding under duress (dropping low priority samples), and pre-aggregation at collection time to reduce cardinality before storage.

Google's Monarch system addresses this at planet scale through hierarchical aggregation and placement strategies that consider both geographic distribution and tenant isolation. Uber's M3 uses a sharded topology with replication across zones, providing both high availability and capacity isolation per tenant to prevent noisy neighbor problems where one tenant's burst starves others.

💡 Key Takeaways

•Time only sharding creates hot "now" partition where all current writes collapse into single shard while historical shards sit idle

•Compound sharding uses (time_bucket, series_hash mod shard_count) to distribute current traffic across all shards while maintaining time locality

•Series identifier is hash of measurement plus all tag key value pairs providing stable routing: cpu.usage,host=server42,region=us-west hashes to fixed value

•Skewed tag distributions create hot partitions even with series hashing when single customer or service dominates traffic volume causing p99 latency to spike from 10ms to 500ms

•Mitigation strategies: dynamic rebalancing based on observed load, ingestion side load shedding for low priority samples, pre-aggregation to reduce cardinality

•Multi-tenant isolation critical to prevent noisy neighbors: per tenant quotas, circuit breakers, and resource isolation at shard and placement level

📌 Examples

Uber M3: sharded topology with replication across zones providing high availability and capacity isolation per tenant to handle tens of millions of samples per second

Google Monarch: planet scale globally replicated system using hierarchical aggregation and placement strategies considering geographic distribution and tenant isolation

Hot partition scenario: single customer generating 80% of metric volume causes subset of shards to saturate while others remain underutilized, p99 query latency degrades from 10ms to 500ms

← Back to Time-Series Databases Overview