TSDB Cardinality Explosion and Tag Management

Cardinality explosion is the most common production failure mode in Time Series Databases (TSDBs) and occurs when high dimensional tagging creates an uncontrolled multiplication of unique series. Each unique combination of measurement plus tag key value pairs becomes a distinct series consuming memory for indexing and buffering. Systems that start with manageable cardinality (thousands of series) can suddenly explode to millions when engineers add tags like user identifier, request identifier, or session identifier without understanding the multiplicative effect.

The memory impact is severe because TSDBs maintain in memory structures for active series: hash maps for series lookup, inverted indexes for tag filtering, and reorder buffers for out of order writes. A system handling 100,000 series might use 2 to 4 GB of memory, but growing to 10 million series can require 200 to 400 GB causing Out Of Memory (OOM) kills during compaction when the process needs additional heap space. Query performance degrades simultaneously because tag filters must scan larger indexes and group by operations produce massive result sets.

Production systems implement strict cardinality controls. Prometheus warns when cardinality crosses thresholds and some deployments enforce hard limits rejecting new series. The principle is to use tags for low cardinality dimensions (environment with 3 values, region with 10 values, service with 100 values gives 3,000 combinations) and avoid high cardinality identifiers. For per user metrics that would create millions of series, the pattern is pre-aggregation at collection time (count users by cohort before storing) or separate systems (high cardinality logs in search engines, low cardinality metrics in TSDB).

Hash bucketing is a mitigation technique where high cardinality values are hashed into fixed buckets. Instead of user_id as a tag with millions of values, compute user_bucket as hash(user_id) mod 100 creating only 100 series. This loses per user detail but enables aggregate analysis like "compare buckets to detect anomalies" while keeping cardinality bounded. Netflix and Uber presentations emphasize schema governance and cardinality reviews before allowing new tag keys into production systems.

💡 Key Takeaways

•Cardinality is count of unique series where each combination of measurement plus all tag key value pairs creates distinct series consuming memory and index space

•Memory impact: 100,000 series uses 2 to 4 GB but growing to 10 million series requires 200 to 400 GB causing OOM kills during compaction when heap space needed

•Multiplicative explosion: environment 3 values, region 10 values, service 100 values creates 3,000 series but adding user_id with 1 million values explodes to 3 billion series

•Symptoms include query timeouts (tag filters scan massive indexes), OOM during compaction (process needs additional heap), and ingestion backpressure (cannot keep up with new series creation)

•Safe pattern: use tags for low cardinality dimensions (environment, region, service) and avoid high cardinality identifiers (user_id, request_id, session_id)

•Hash bucketing mitigation: compute user_bucket as hash(user_id) mod 100 creating only 100 series instead of millions, loses per user detail but enables aggregate anomaly detection while keeping cardinality bounded

📌 Examples

Production failure: adding request_id tag to track individual requests across millions of users causes cardinality explosion from thousands to billions of series, system OOMs within hours

Prometheus cardinality management: warns when series count crosses thresholds, some deployments enforce hard limits rejecting new series to prevent runaway growth

Netflix and Uber approach: schema governance with cardinality reviews required before allowing new tag keys into production, pre-aggregation at collection time for high cardinality dimensions

Hash bucketing example: instead of tagging with customer_id (millions of values), use customer_bucket = hash(customer_id) mod 1000 to get 1000 series while still detecting which buckets show anomalies

← Back to Time-Series Databases Overview