Database Design • Search Databases (Elasticsearch, Solr)Hard⏱️ ~2 min
High Cardinality Aggregations and Memory Safety
Aggregations in search systems (faceting, distinct counts, percentiles) execute per shard then reduce on the coordinator node, but high cardinality operations can easily exhaust heap memory and crash nodes. A terms aggregation over a field with millions of unique values must track counts for each term in memory, and without safeguards, a single malicious or poorly designed query can allocate gigabytes of heap.
Probabilistic data structures solve this by trading perfect accuracy for bounded memory. HyperLogLog++ (HLL++) provides distinct counts with approximately 1 to 2% error using only a few kilobytes regardless of cardinality: counting 1 million unique user IDs uses the same 16 KB as counting 1 billion. T-digest and Q-digest compute percentiles over streaming data with controlled memory (typically 1 to 10 KB) instead of storing all values.
Doc values enable efficient aggregations by storing field values in columnar format on disk rather than loading entire documents into heap. However, high cardinality text fields without doc values force fielddata structures into heap, which can consume gigabytes per shard. Always use keyword fields with doc values for aggregations, and enforce cardinality limits (such as max 10,000 unique terms per aggregation) through query validation.
GitHub's Elasticsearch deployment encountered memory pressure from high cardinality aggregations over repository metadata, where unbounded queries could trigger heap exhaustion on nodes. Circuit breakers provide last resort protection by rejecting queries when heap usage crosses thresholds (typically 70 to 85% of heap), but the better approach is prevention through enforced limits, approximate aggregations for high cardinality fields, and query timeouts. Uber applies per tenant aggregation limits and uses HLL++ for unique driver or restaurant counts across cities to maintain stability at global scale.
💡 Key Takeaways
•High cardinality aggregations (millions of unique values) can exhaust heap memory by tracking counts per term, causing node crashes without safeguards
•HyperLogLog++ (HLL++) provides distinct counts with 1 to 2% error using fixed 16 KB memory whether counting 1 million or 1 billion unique values
•Doc values store field values in columnar disk format for efficient aggregations, while fielddata loads text fields into heap and can consume gigabytes per shard
•Circuit breakers reject queries when heap crosses thresholds (70 to 85%) but prevention through cardinality limits and approximate aggregations is better
•T-digest and Q-digest compute percentiles over streaming data with 1 to 10 KB memory instead of storing all values in heap
•GitHub experienced memory pressure from unbounded aggregations over repository metadata, requiring enforced limits and approximate structures for stability
📌 Examples
Distinct user count: Traditional exact aggregation over 10 million users requires hundreds of MB heap vs HLL++ uses 16 KB with 1.5% error
Uber city analytics: Uses HLL++ for unique driver and restaurant counts across millions of entities per city, maintaining bounded memory at global scale
Circuit breaker protection: Query attempts to aggregate 5 million unique terms, heap usage hits 80% threshold, breaker rejects query before crash