Failure Modes and Edge Cases in Production

When Things Go Wrong:

In interviews, demonstrating awareness of failure modes shows production experience. Let's cover where Druid breaks and how to mitigate.

Ingestion Backpressure Crisis:

Your Kafka topic normally sees 1 million events per second. Suddenly it spikes to 10 million due to a traffic surge or upstream bug emitting duplicate events. Your 20 indexing tasks can't keep up. What happens?

First, real time query latency increases. Queries that should return data from the last 5 minutes now only see data from 10 minutes ago because recent segments haven't been built yet. The gap between event time and queryable time grows from 30 seconds to 5 minutes or more.

Second, indexing tasks start running out of memory as they buffer more events before persisting segments. Tasks crash and restart, further falling behind. In severe cases, this creates a death spiral: tasks repeatedly OOM (Out Of Memory), Kafka consumer lag grows to millions of messages, and recovery takes hours.

❗ Remember: Monitor Kafka consumer lag as a leading indicator. Alert when lag exceeds 1 minute of data. Mitigation: auto scale indexing tasks based on lag, apply backpressure to producers, or temporarily increase segment persist frequency to reduce memory pressure.
High Cardinality Dimension Explosion:

You decide to add user_id as a dimension on a clickstream dataset with 50 million daily active users. Each user generates 100 events per day. With rollup, you expect compression, right?

Wrong. Rollup only works when multiple events share the same dimension combination. With user_id included, almost no events have identical dimensions (same user, same timestamp bucket, same country, same device). Your data goes from 1 million rolled up rows per hour to 500 million raw rows. Segment sizes explode from 200 MB to 50 GB compressed.

Impact of High Cardinality Dimension
200 MB
WITH ROLLUP
50 GB
NO ROLLUP

Query performance degrades severely. Dictionary encoding for 50 million distinct user_id values per day consumes gigabytes of memory. Bitmap indexes become massive. Simple aggregations that took 80 milliseconds now take 5 seconds.

Solution: don't dimension high cardinality identifiers. Store user_id as a metric (non indexed column) for rare drill downs, or hash it to reduce cardinality (user_id modulo 1000), or offload per user detail to a separate key value store and join in application layer.

Deep Storage Availability Failure:

Your S3 bucket becomes temporarily unavailable due to AWS region issues. What's the blast radius?

Query serving continues normally for segments already cached on historical nodes. But new segment loads fail. If a historical node crashes during the outage, it can't reload its segments from S3, reducing cluster capacity. Coordinators can't assign new segments or rebalance existing ones. Your cluster is essentially frozen in its current state.

Ingestion is affected differently. Real time tasks can still ingest and build segments in memory, but they can't persist to S3. This limits how long they can run before running out of memory. If tasks crash before S3 recovers, you lose unbounded data.

⚠️ Common Pitfall: Don't treat Druid as your only copy of data. Keep the source Kafka topic with sufficient retention (7 to 14 days) so you can re ingest if deep storage fails catastrophically. Test your restore procedures regularly.
Query Pathologies:

Certain query patterns cause predictable failures. An unbounded time range query ("show me all clicks ever") forces scanning every segment in deep storage: hundreds of terabytes, potentially taking minutes or timing out. A high cardinality GROUP BY on a poorly indexed dimension (GROUP BY session_id with 1 billion distinct sessions) causes memory exhaustion at the broker during merge.

Protection mechanisms: set query timeout limits (30 seconds default), enforce time range maximums (no queries over 90 days), use query result size limits, and monitor broker heap usage. For multi tenant environments, use query priority queues to isolate expensive queries from critical dashboards.

💡 Key Takeaways

✓Ingestion backpressure from 10x traffic spikes causes growing lag between event time and queryable time, from 30 seconds to 5 plus minutes, with potential task OOM crashes

✓High cardinality dimensions (50 million distinct user IDs) break rollup effectiveness, exploding segment sizes from 200 MB to 50 GB and degrading query latency from 80 milliseconds to 5 seconds

✓Deep storage outages freeze cluster state: queries continue on cached segments but no new segment loads, ingestion risks data loss if tasks crash before persist

✓Unbounded time range queries or high cardinality GROUP BY operations cause broker memory exhaustion and timeouts, requiring query guards and resource limits

✓Metadata store (PostgreSQL/MySQL) failure stops segment management but not query serving, making it critical to run with high availability replicas

📌 Interview Tips

1Black Friday traffic spike: ingestion backpressure from 2 million to 15 million events per second, Kafka lag grows to 10 minutes, mitigation by auto scaling from 20 to 80 indexing tasks takes 5 minutes, temporary data freshness degradation

2Adding user_id dimension to 5 billion event per day dataset: rollup ratio drops from 100:1 to 1:1, daily segment size increases from 50 GB to 5 TB, query latency increases 50x, fixed by removing user_id as dimension and storing in separate user detail service

3Query without time filter on 2 year dataset (300 TB): attempts to scan all 17,520 hourly segments, broker times out after 2 minutes, prevented by enforcing 90 day maximum time range in query gateway

← Back to Apache Druid for Real-time Analytics Overview