Data Warehousing Fundamentals • Materialized Views & AggregationsMedium⏱️ ~3 min
Layered Aggregation Architecture
The Real World Pipeline: In production systems handling high event volumes, materialized views exist as part of a multi layered architecture between raw data ingestion and serving.
Consider a clickstream analytics pipeline for a large consumer app generating around 1 million events per second globally. Events flow into Kafka or Pub/Sub, land in a data lake like S3, then ingest into a warehouse like BigQuery or Snowflake. The raw events table grows by multiple terabytes daily with tens of billions of rows.
Why Layers Matter: You cannot just jump from raw events to final dashboard queries. The optimization happens in stages, each reducing data volume and query complexity.
Platform Specific Approaches: Different vendors implement the maintenance differently. BigQuery incremental materialized views only recompute partitions that changed, effective when you partition by date with tens of TB of data. Snowflake uses change data capture via streams to apply only new or modified rows. ClickHouse runs materialized views as insert triggers, updating aggregate tables synchronously or asynchronously on every insert.
Smart Read Paths: Production queries often combine layers. A dashboard covering 30 days might read daily aggregates for the first 29 days and hourly aggregates or even raw data for the current incomplete day, balancing freshness with performance.
1
Raw Events: Streaming job or warehouse native feature computes per 5 minute aggregates per dimension like user, country, app version. This might reduce 300 million raw events per 5 minutes down to 5 million aggregate rows.
2
Intermediate Rollups: Those 5 minute aggregates roll up into hourly tables, reducing 12 five minute buckets into 1 hourly summary. Volume drops another 10x to 12x.
3
Daily Aggregates: Hourly data rolls into daily tables. A dashboard query for "last 30 days by country" now scans roughly 30 rows per country instead of billions of raw events.
Data Volume Reduction
30B rows
RAW EVENTS
5M rows
5 MIN AGGS
30 rows
DAILY SUMMARY
💡 Key Takeaways
✓Production systems use multiple aggregation layers: raw events → 5 minute aggregates → hourly rollups → daily summaries, each reducing volume by 10x to 1000x
✓A typical pipeline processing 1 million events per second reduces 30 billion raw events down to roughly 30 rows for a 30 day dashboard query
✓Platform implementations differ: BigQuery uses incremental partition refresh, Snowflake uses change data capture streams, ClickHouse uses insert triggers
✓Smart read paths combine layers: query daily aggregates for historical data but hourly or raw data for the current incomplete day to balance freshness and speed
📌 Examples
1Consumer app with 1M events/sec creates 5 minute aggregates per user and country (5M rows per 5 min), then hourly rollups (reducing 12 buckets to 1), then daily summaries for dashboard queries
2Dashboard query "DAU by country last 30 days" reads 29 days from daily aggregates table plus current day from hourly table, achieving sub 100ms p95 latency