What is Pre-aggregation?

Definition
Pre-aggregation means computing summary metrics ahead of time rather than calculating them for every query. Instead of scanning billions of raw events each time someone loads a dashboard, you maintain compact tables of already computed sums, counts, and averages.
The Core Problem:
Imagine you have 50 billion raw click events in your data warehouse. Every time an executive opens a dashboard asking "how many daily active users by country?", the system must scan those 50 billion rows, filter, group, and count. Even with parallel processing, this might take 30 seconds and consume hundreds of CPU cores. If 100 people view that dashboard daily, you're burning massive compute resources on the same calculation over and over.

How Pre-aggregation Solves This:
Instead of querying raw events repeatedly, you run a background job that computes the answer once and stores it in a much smaller table. You scan the 50 billion raw events, group by country and date, and produce maybe 10 million aggregate rows (200 countries times 365 days times a few years). Now when someone opens the dashboard, the query hits this tiny pre-aggregated table instead, scanning 10 million rows rather than 50 billion.

Query Performance Impact
RAW DATA
30 sec
→
PRE-AGGREGATED
200 ms
The Fundamental Trade-off:
You trade storage space and update complexity for query speed. The pre-aggregated table takes extra disk space, and you need pipeline logic to keep it updated as new events arrive. But for queries that hit the same aggregations repeatedly, the performance gain is dramatic: 150x faster in this example.

💡 Key Takeaways

✓Pre-aggregation computes summary metrics once and stores them, avoiding repeated expensive calculations over raw data

✓Reduces data volume scanned per query by 100x to 1000x, turning 30 second queries into 200 millisecond responses

✓Primary trade-off is extra storage and pipeline complexity for dramatically faster query performance

✓Most valuable when the same aggregations are queried repeatedly, such as dashboards viewed hundreds of times daily

📌 Interview Tips

1A consumer app logs 5 million events per second. Querying raw data for daily active users takes 30 seconds. Pre-aggregating by country and day reduces this to 10 million rows, enabling 200ms query times.

2An e-commerce site maintains hourly aggregates of revenue by product category. Instead of scanning 2 billion transaction records, queries hit a table with 50,000 rows (100 categories times 500 hours).

← Back to Pre-aggregation & Rollup Patterns Overview