Production Architecture and Data Flow

The Complete System:

In production, Druid never runs alone. It sits between event producers and dashboards in a multi component analytics stack. Understanding this end to end flow is critical for system design interviews.

Ingestion Pipeline:

Applications and services emit events into Kafka or Kinesis, often millions per second across thousands of partitions. At Netflix scale, you might see 5 to 10 million clickstream events per second globally. Druid indexing tasks, running on middle manager nodes, subscribe to these streams. Each task is responsible for a specific time range and set of partitions.

As events arrive, the indexing task builds an in memory incremental index for a time slice, typically a few minutes. During this window, it applies rollup: if 1,000 events have identical dimension values (same timestamp bucket, country, device type), they're aggregated into a single row with summed metrics. This can reduce storage from gigabytes to megabytes for repetitive event patterns.

1
Ingest from stream: Indexing task pulls events from Kafka, builds in memory columnar index with rollup, applies dictionary encoding and compression.
2
Persist segment: Every few minutes, task writes immutable segment file to deep storage (S3 or HDFS), then notifies coordinator.
3
Load and serve: Coordinator assigns segment to historical nodes based on load balancing rules. Historicals download from S3 and cache on local SSD.
4
Query execution: Brokers route queries to relevant historicals and real time tasks, merge results, return to dashboard with typical 100 millisecond latency.
Storage Layer Details:

Deep storage on S3 holds months or years of data, often 50 to 200 terabytes for large deployments. This is your source of truth. Historical nodes are stateless compute: they can be added or removed freely because they always reload segments from S3. A typical production setup might keep 7 to 30 days of hot data cached on historical SSDs (5 to 20 terabytes), with older data queryable but slower since it requires S3 reads.

The metadata store, usually PostgreSQL or MySQL, tracks which segments exist, their time ranges, and which historicals currently serve them. If this database becomes unavailable, query serving continues but segment management stops. This is why you run the metadata store with high availability replicas.

⚠️ Common Pitfall: Underestimating deep storage I/O. If your query pattern frequently touches cold data (older than what's cached on historicals), you'll see high S3 read costs and slower queries. Plan hot data retention based on actual query patterns, not just total data volume.
Elastic Scaling:

This separation of compute and storage enables true elasticity. Under a 10x query load spike, add more broker and historical nodes. They'll start serving immediately by loading segments from S3. Under a 10x ingest spike, increase indexing task parallelism and possibly add middle managers. The segments they produce go to the same shared S3 bucket, automatically discovered by coordinators.

At companies like Airbnb, Druid clusters handle varying load by scaling historical nodes during peak hours (daytime dashboards) and scaling back at night, while maintaining constant ingest for 24/7 event streams.

💡 Key Takeaways

✓Indexing tasks ingest from Kafka at millions of events per second, build in memory indexes with rollup, persist immutable segments to S3 every few minutes

✓Deep storage on S3 holds 50 to 200 terabytes durably, while historical nodes cache 7 to 30 days of hot segments on local SSDs for fast access

✓Separation of compute (stateless query nodes) and storage (durable S3 segments) enables elastic scaling: add nodes to handle load without data rebalancing

✓Coordinators manage segment assignment and replication using metadata store, ensuring each segment is available on multiple historical nodes for fault tolerance

✓Typical data freshness is 10 to 60 seconds from event time to queryable, determined by how frequently indexing tasks persist segments

📌 Interview Tips

1Ad tech platform: 8 million events per second from Kafka, 50 indexing tasks building segments every 5 minutes, 40 historical nodes serving 15 days of hot data (12 TB cached on NVMe), 180 days in S3 (85 TB total)

2Real time fraud detection: events queryable within 15 seconds of occurrence, queries filter last 10 minutes grouped by merchant, p50 latency 60 milliseconds across 200 concurrent dashboard users

3E commerce analytics: Black Friday traffic spikes 20x, cluster auto scales from 10 to 50 historical nodes in 10 minutes, query latency stays under 200 milliseconds by loading segments from S3 in parallel

← Back to Apache Druid for Real-time Analytics Overview