Failure Modes and Edge Cases

Stale and Inconsistent Aggregates:
The most common failure mode in pre-aggregation systems is serving stale or incomplete data without clear indication to users. If your upstream Extract, Transform, Load (ETL) job fails or lags, dashboards continue returning results that might be missing hours or days of recent data. Users see "success" responses and assume the data is current.

For example, a daily finance rollup misses one hour of transactions due to a late arriving partition from a third party system. Reported revenue is off by 3 percent, but the dashboard shows no error. This can lead to incorrect business decisions. Production systems need explicit completeness checks: comparing aggregate row counts to raw fact control totals, verifying time window coverage, and surfacing freshness metadata such as "last updated 2 hours ago" prominently in dashboards.

❗ Remember: Aggregates fail silently. A query against incomplete aggregates returns partial results with no error. Always instrument completeness checks and expose freshness to users.
Double Counting During Backfills:
Reprocessing historical data creates subtle correctness risks. Suppose you maintain hourly aggregates but discover a bug that underreported metrics for the last 30 days. You need to backfill by replaying events. If you run the aggregation pipeline again without deleting old aggregates first, you'll double count unless the pipeline is explicitly idempotent.

Idempotent aggregation means applying the same input multiple times produces the same output. This requires either a delete-then-insert pattern (remove old aggregates for the time range, then recompute) or upsert logic that can detect and replace existing rows. Without this, backfills silently corrupt your data. Partial failures during backfill are equally dangerous: if backfill succeeds for 20 of 30 days, you now have 20 days with corrected data and 10 days with old incorrect data, creating discontinuities in time series that are hard to debug.

1
Detect Need for Backfill: Bug discovered in aggregation logic affecting last 30 days of metrics
2
Delete Old Aggregates: Remove rows for affected time range from aggregate tables
3
Replay Events: Reprocess raw events through corrected aggregation pipeline
4
Validate Completeness: Compare new aggregate totals to raw fact control totals
Dimension Hierarchy Changes:
Pre-aggregated rollups often depend on dimensional hierarchies such as city rolls up to state rolls up to country, or Stock Keeping Unit (SKU) rolls up to category. When a product moves from electronics to home goods, historical aggregates computed under the old hierarchy now conflict with current definitions. Queries that drill down from category to SKU will show inconsistent totals.

Some organizations accept this and treat hierarchies as "current state" only. Others maintain slowly changing dimensions (SCD), tracking when each SKU belonged to each category over time. This enables time aware rollups but significantly complicates aggregation logic: you need to join to the SCD table to determine which hierarchy applied on each historical date. Storage and query complexity both increase substantially.

High Cardinality Explosion:
Pre-aggregating by high cardinality dimensions such as user_id, session_id, or URL can explode aggregate row counts and eliminate most benefits. If you have 100 million active users and aggregate by user, country, and day, you might generate 7 billion aggregate rows per month. This is often larger than the raw event data after compression, removing the entire point of pre-aggregation.

Production systems need heuristics to limit aggregation to bounded cardinality dimensions, or to aggregate at coarser levels. For example, aggregate by user segment (1000 segments) instead of individual user. Or use sampling: pre-aggregate a 10 percent sample by user, then scale estimates accordingly. Finally, skew matters: a very large customer or one dominant country can consume disproportionate resources. Special casing hot keys with separate processing paths prevents them from overwhelming the system.

💡 Key Takeaways

✓Stale aggregates fail silently: queries return partial results with no error. Implement completeness checks comparing aggregate totals to raw fact control totals

✓Backfills risk double counting unless aggregation is idempotent. Use delete-then-insert or upsert patterns, and validate completeness after backfill

✓Dimension hierarchy changes create inconsistencies. Choose between accepting current state hierarchies or maintaining slowly changing dimensions with time aware rollups

✓High cardinality dimensions like user_id can generate more aggregate rows than raw data. Limit aggregation to bounded cardinality or use coarser levels like user segments

📌 Interview Tips

1A finance rollup misses 1 hour due to late partition arrival. Revenue is underreported by 3%, but dashboard shows no error. Completeness checks detect the gap.

2Backfilling 30 days of corrected metrics without deleting old aggregates first silently doubles all counts for that period.

3Pre-aggregating by user_id for 100 million users generates 7 billion rows per month, larger than compressed raw events, eliminating the storage benefit.

← Back to Pre-aggregation & Rollup Patterns Overview