What is Apache Druid?

Definition
Apache Druid is a real time Online Analytical Processing (OLAP) database designed for subsecond analytical queries on streaming event data while it's still being ingested at massive scale.
The Problem It Solves:

Traditional data warehouses like Snowflake batch load data every few minutes or hours. They're too slow for real time analytics. Streaming systems like Kafka handle ingest well but struggle with complex aggregations. You need something that can answer questions like "show me website clicks by country in the last 15 minutes" with subsecond latency while millions of new events pour in every second.

This is the exact gap Druid fills.

Core Architecture:

Druid models data as fact tables of events, always with a timestamp and dimensions (attributes like country, device type) plus metrics (numbers like click count, revenue). It ingests streams continuously from sources like Kafka, builds columnar indexes in memory, and periodically writes immutable compressed segments to cheap storage like S3.

Here's what makes it different: queries run on these columnar segments, not directly on the stream. This separation allows you to scan billions of rows with subsecond latency. Data is partitioned by time (usually hourly or daily), so time filtered queries only touch relevant segments.

Typical Query Performance
50-150ms
P50 LATENCY
<500ms
P99 LATENCY
Common Use Cases:

Druid shines in scenarios requiring both high ingest throughput and interactive queries. Real time fraud detection systems that analyze transaction patterns as they occur. Ad campaign monitoring dashboards that show click through rates updated every few seconds. Game telemetry where you need to aggregate player actions across millions of concurrent users. Operational business intelligence where executives need up to the minute metrics on sales or user behavior.

💡 Key Takeaways

✓Apache Druid is a real time OLAP database that enables subsecond analytical queries on streaming data at millions of events per second

✓Data is modeled as time series events with dimensions (country, device) and metrics (clicks, revenue), stored in columnar format

✓Time based partitioning by hour or day allows queries to prune irrelevant data, achieving p50 latencies of 50 to 150 milliseconds

✓Rollup at ingest can pre aggregate events with identical dimensions and time buckets, reducing storage by 5x to 100x

✓Separates ingest (streaming from Kafka), storage (immutable segments on S3), and query serving (historical nodes) for elastic scaling

📌 Interview Tips

1Ad tech platform ingesting 5 to 10 million events per second, maintaining 30 to 90 days of hot data, powering hundreds of dashboards with subsecond query response

2Fraud detection system analyzing transaction patterns in real time with queries like: show suspicious patterns in last 10 minutes, grouped by merchant and card type

3Game telemetry aggregating player actions across millions of concurrent users, with dashboard queries completing in under 200 milliseconds

← Back to Apache Druid for Real-time Analytics Overview