Data Warehousing Fundamentals • Materialized Views & AggregationsHard⏱️ ~3 min
Refresh Strategies and Maintenance Patterns
The Maintenance Challenge: Keeping materialized views in sync with base data is the hard part. There are three main strategies, each with different consistency guarantees and operational complexity.
Native Materialized View Support: Data warehouses like BigQuery and Snowflake track changes to base tables and automatically update materialized views. BigQuery incremental materialized views only recompute partitions where data changed, which works well when you partition by ingestion time or date. For a table with 500 partitions (roughly 500 days), only 1 or 2 partitions typically need refresh per update cycle.
Snowflake uses streams for change data capture. A stream records all inserts, updates, and deletes since last consumption. The materialized view refresh job consumes the stream and applies only those changes. This is efficient but requires careful handling of exactly once semantics to avoid double counting.
The advantage is convenience: the platform handles mechanics. The disadvantage is less transparency and potential limitations on query complexity or supported aggregations.
Streaming Aggregation Jobs: Systems like Kafka Streams, Flink, or Spark Structured Streaming consume event streams and continuously update aggregation tables or key value stores. This gives very low latency, often end to end freshness within seconds.
The challenge is correctly handling state management, late arriving data, and exactly once or idempotent writes. With at least once delivery, you must ensure aggregation updates are idempotent or use transactional writes to avoid double counting.
Batch Aggregation Jobs: A scheduled job recomputes aggregates for recent windows every 5 to 15 minutes. For large tables, combine with merge incremental approach: only recompute the last N hours or days each run, and leave older partitions untouched.
For example, a daily batch job might recompute aggregates for the past 7 days and merge them into the main table. This handles late arrivals up to 7 days and keeps most historical data stable. The trade off is higher latency (minutes to tens of minutes) but simpler operational model.
1
Consume events from Kafka: Read events with exactly once or at least once semantics.
2
Maintain stateful aggregates: Use windowed operations with state backends like RocksDB.
3
Write to serving store: Emit results to database, key value store, or warehouse table.
✓ In Practice: Large companies like Google or Meta often combine streaming for low latency critical metrics and batch for comprehensive historical rollups, choosing the right tool per use case.
Consistency and Serving: To avoid readers seeing partially updated data, use atomic swaps. Write new aggregates to a staging table or partition, then atomically swap table pointers or partition metadata. Readers always see a consistent snapshot, never a mix of old and new data.
Versioned tables are another pattern: readers always query the latest stable version while refresh jobs build the next version in the background. Once complete, you promote the new version atomically.💡 Key Takeaways
✓Three refresh strategies: native platform support (BigQuery incremental, Snowflake streams), streaming jobs (Flink, Kafka Streams with seconds latency), and batch jobs (scheduled recompute every 5 to 15 minutes)
✓BigQuery only refreshes changed partitions: for 500 daily partitions, typically only 1 or 2 need updates per cycle, making incremental refresh efficient at large scale
✓Streaming aggregations provide seconds of end to end latency but require careful state management and exactly once or idempotent semantics to avoid double counting with at least once delivery
✓Atomic swaps prevent partial read visibility: write to staging table, then atomically swap pointers so readers always see complete consistent snapshots, never mixed old and new data
📌 Examples
1BigQuery materialized view partitioned by date refreshes only yesterday and today partitions when new data arrives, leaving 498 older partitions untouched
2Flink streaming job consumes Kafka events, maintains windowed state in RocksDB, and writes 5 minute aggregates to Postgres with transactional inserts to guarantee exactly once semantics