Cardinality Explosion: The Silent Killer of Time Series Systems

What is Cardinality:

Cardinality in time series systems refers to the number of unique combinations of measurement name and tag values, which equals the number of distinct time series you track. A metric like http_requests with tags for service (10 values), region (5 values), and status_code (10 values) yields 10 times 5 times 10 equals 500 unique series. Each series needs its own index entry, metadata, and storage structures. When cardinality explodes from thousands to millions, memory usage spikes, ingestion slows, and queries fail.

The Explosion Scenario:

Consider a team that instruments a high volume API with a metric tagged by endpoint, method, region, and decides to add user_id as a tag to debug a specific customer issue. If you have 1 million active users hitting 100 endpoints across 5 regions, you suddenly have 1,000,000 times 100 times 5 equals 500 million unique series. If each series consumes 1 kilobyte of index memory, you need 500 gigabytes just for the index. Systems designed for 100 thousand to 1 million series now face 500 million.

Datadog and InfluxData both emphasize cardinality controls because this failure mode is common and catastrophic. At Datadog scale, unbounded cardinality can cause ingestion queues to back up, coordinator nodes to run out of memory, and query performance to degrade from milliseconds to timeouts. In severe cases, the system rejects new series or drops data to avoid total failure.

❗ Remember: High cardinality tags like user identifiers, request identifiers, or session tokens should live in logs or traces, not metric tags. Metrics aggregate across many events; logs capture individual events with arbitrary detail.
Strategies to Control Cardinality:

First, choose tag dimensions carefully. Good tags are low cardinality and stable over time: environment (production, staging), region (us_west, eu_central), service name, host or pod identifier if bounded by fleet size. Bad tags are unbounded: user_id, trace_id, timestamp strings, dynamically generated identifiers.

Second, enforce limits at ingestion time. Many systems set a maximum like 1 million active series per metric per tenant. When a new series arrives that would exceed the limit, the system either rejects it with a clear error or applies sampling. This prevents runaway cardinality from one misconfigured service.

Third, monitor cardinality actively. Track the number of unique series per metric and alert when growth rate exceeds expected patterns. A sudden 10 times jump in cardinality often indicates a bug in instrumentation, such as accidentally including a unique identifier in a tag.

The Trade Off:

Limiting cardinality means losing granularity. If you cannot tag by user_id, you cannot directly query metrics for a single user without correlating with logs or traces. The design assumes you aggregate across many users or requests, using metrics for trends and alerts, while drilling into individual cases with logs. This separation of concerns is central to observability architecture but requires discipline.

Real World Impact:

At a company running a large scale time series system, a single misconfigured microservice introduced a tag with request_id, generating 50 million new series in 10 minutes. The ingestion cluster memory utilization spiked from 60 percent to 95 percent, causing garbage collection pauses and write timeouts. Dashboards querying unrelated metrics slowed to 10 to 20 second load times due to index contention. The incident was resolved by blocking the offending metric at the ingestion layer and educating the team on cardinality best practices.

💡 Key Takeaways

✓Cardinality equals unique combinations of metric name and tag values; adding a tag with 1 million values to a metric with 500 series explodes it to 500 million series.

✓Each series consumes index memory (typically 1 kilobyte or more), so 500 million series requires 500 gigabytes of memory, causing out of memory errors and ingestion failures.

✓Good tags are low cardinality and stable (service, region, environment); bad tags are unbounded (user_id, request_id, trace_id, dynamically generated strings).

✓Systems enforce limits like 1 million active series per metric per tenant, rejecting or sampling new series beyond the limit to prevent runaway cardinality.

✓Cardinality explosion incidents at scale cause memory spikes from 60 percent to 95 percent utilization, garbage collection pauses, write timeouts, and dashboard queries degrading from milliseconds to 10+ seconds.

📌 Interview Tips

1A microservice adds request_id tag to http_requests metric, generating 50 million new series in 10 minutes, spiking memory from 60 percent to 95 percent and causing write timeouts.

2Datadog limits customers to around 1 million custom metrics per account by default, enforcing this at ingestion to avoid cardinality explosions that could destabilize the platform.

3A safe metric design: http_requests with tags {service: checkout, region: us_east, status: 200, method: POST} yields 10 services times 5 regions times 10 statuses times 5 methods equals 2500 series, well within limits.

← Back to Time-Series Data Modeling Overview