Production Scale Reality

The Scale Challenge:
At production scale, pre-aggregation and rollup patterns become essential infrastructure, not optional optimization. Consider a large consumer app like Meta with 1 billion monthly active users. The logging pipeline ingests 5 million events per second at peak, generating roughly 400 billion events per day. Storing this raw data in columnar format in object storage is feasible, but querying it directly for dashboards is not.

A single complex analytical query scanning even 1 percent of this data might take 10 to 30 seconds and consume hundreds of CPU cores. If you have 50 dashboards refreshed every 5 minutes by various teams, that's 14,400 dashboard loads per day. Running each query against raw data would require thousands of cores running continuously just to serve dashboards, costing millions of dollars per year in compute alone.

The Aggregation Layer:
Instead, production systems insert an aggregation layer between raw storage and analytical consumers. Batch or streaming jobs continuously compress raw events into metric tables. For example, you might aggregate per user, per app version, per country, per hour. With 1 billion users, 100 app versions, 200 countries, and 24 hours, that's approximately 480 billion combinations theoretically. In practice, you aggregate only active combinations, reducing data volume by 100x to 1000x.

Production Dashboard Performance
50ms
P50 LATENCY
1 sec
P99 LATENCY
3000
QPS

Now the online analytics service powering dashboards hits these pre-aggregated tables. With reduced data size and better partitioning, queries run in 50 to 300 milliseconds at p50 and under 1 second at p99, even at several thousand queries per second. This is how internal BI tools and executive dashboards achieve interactive performance on petabyte scale datasets.

Hybrid Architectures:
Large systems often use hybrid approaches. Streaming aggregation maintains metrics for the last 24 to 48 hours in a low latency store with 10 to 60 second freshness. Batch pipelines periodically compact these into longer term tables optimized for historical queries. This splits the workload: streaming handles operational monitoring where freshness matters, batch handles heavy analytical queries where 1 to 24 hour lag is acceptable.

⚠️ Common Pitfall: Over aggregating can create more problems than it solves. A full cube across 10 high cardinality dimensions might be several times larger than raw data and take days to rebuild, while only improving 20 percent of queries.
Cost Impact:
The economics are compelling. At Meta scale, pre-aggregation reduces compute cost for analytical queries by 90 to 95 percent compared to always querying raw data. Storage costs increase by perhaps 30 to 50 percent, but compute is typically 5 to 10 times more expensive per unit of work, making this a clear win. The system complexity is the real cost: managing aggregation pipelines, ensuring correctness, and handling backfills requires significant engineering investment.

💡 Key Takeaways

✓At Meta scale with 5 million events per second, querying raw data takes 10 to 30 seconds per query and requires thousands of CPU cores continuously

✓Pre-aggregation reduces data volume by 100x to 1000x, enabling 50ms p50 and 1 second p99 latency at 3000+ QPS for dashboards

✓Hybrid architectures use streaming for recent data with 10 to 60 second freshness, batch for historical data with 1 to 24 hour lag

✓Reduces analytical compute cost by 90 to 95 percent while increasing storage by 30 to 50 percent, a clear economic win

📌 Interview Tips

1Meta's internal BI tools serve thousands of dashboards hitting pre-aggregated tables at 3000 QPS with sub second latency, impossible with raw data queries

2A large consumer app maintains streaming aggregates for the last 48 hours (operational monitoring) and batch aggregates for historical analysis (finance reporting), splitting workloads by freshness requirements

← Back to Pre-aggregation & Rollup Patterns Overview