Storage and Query Optimization for Event Models

The Scale Challenge:

Event models generate massive data volumes. A 10 million user app producing 100 events per user per day creates 1 billion events daily, or roughly 365 billion events per year. At an average of 1 kilobyte (KB) per event (including metadata and context), that's 365 terabytes (TB) of raw data annually. Companies processing events for many customers handle petabytes. Efficient storage and query strategies are not optional.

Partitioning Strategy:

The most common pattern is to partition by event date. Events for January 15, 2024 go into one partition. Events for January 16, 2024 go into another. This optimizes time range queries, which dominate analytics workloads. A query like "show me daily active users for the last 30 days" only scans 30 partitions, not the entire dataset.

However, date only partitioning creates hot partitions for very active users or tenants. The solution is compound partitioning: first by date, then by hash of user ID or tenant ID. For example, events for January 15, 2024 are further split into 100 buckets based on user ID hash. This spreads load across many physical partitions. The tradeoff is that queries for a specific user must scan all 100 buckets for each date, increasing query latency.

Columnar Storage Formats:

Event data is write heavy and read optimized for analytical queries that aggregate specific columns. Columnar formats like Parquet or Optimized Row Columnar (ORC) compress data 5 to 10 times better than row formats like JavaScript Object Notation (JSON) or comma separated values (CSV) and allow query engines to read only the columns needed. For example, a query counting events by type only reads the event_type column, not all 50 columns per event. This reduces input/output (I/O) by 98%.

Additionally, columnar formats support efficient encoding. Timestamps can be delta encoded (store differences from a base value). Categorical fields like event_type or platform can be dictionary encoded (store "iOS" as integer 1, "Android" as integer 2). These techniques further reduce storage by 2 to 5 times beyond basic compression.

✓ In Practice: Switching from JSON to Parquet for event storage typically reduces storage costs by 80% to 90% and improves query performance by 10 to 50 times for analytical workloads.
Tiered Storage:

Not all event data has the same access pattern. Recent events are queried frequently. Events from 6 months ago are accessed rarely. Tiered storage places hot data on fast, expensive storage like Solid State Drives (SSDs) and cold data on slow, cheap storage like object stores. A common pattern is to keep the last 7 days on SSDs for interactive queries, the last 90 days on standard object storage for batch analytics, and older data on archival storage that takes minutes to access.

This reduces costs dramatically. SSD storage might cost 10 cents per gigabyte (GB) per month. Standard object storage costs 2 cents per GB per month. Archival storage costs 0.4 cents per GB per month. For a petabyte dataset, moving 80% to archival storage saves hundreds of thousands of dollars monthly.

Pre Aggregation and Materialized Views:

Interactive analytics require sub second query latencies on billions of rows. This is impossible with raw event scans. The solution is pre aggregation. You materialize common metrics like daily active users, conversion rates, or average session duration as summary tables that update incrementally. A query that would scan 10 billion raw events instead reads a pre aggregated table with 1 million rows (one per user per day), returning results in milliseconds.

The tradeoff is freshness. Pre aggregated views lag raw events by minutes to hours. Systems often maintain both: real time dashboards query recent raw events with some latency, while historical reports query pre aggregated views for speed.

Indexing for Point Queries:

Some queries need to find all events for a specific user or session, not aggregate across all users. Without indexes, this requires scanning entire partitions. Solutions include secondary indexes on user ID or session ID, typically implemented as inverted indexes or bloom filters. A bloom filter can tell you with certainty that a user ID is not in a partition, allowing the query engine to skip it. This reduces scans by 90% to 99% for point queries at the cost of additional storage overhead, typically 1% to 5% of raw data size.

💡 Key Takeaways

✓A 10 million user app generating 100 events per user per day creates 365 TB of raw data annually at 1 KB per event. Companies at scale handle petabytes requiring aggressive optimization.

✓Partition by date first for time range queries, then by hash of user ID or tenant ID to avoid hot partitions. This spreads load across 100+ buckets per date but increases query latency for single user lookups.

✓Columnar formats like Parquet reduce storage by 5 to 10 times versus JSON and improve query performance by 10 to 50 times for analytical workloads by reading only needed columns.

✓Tiered storage places last 7 days on SSDs ($0.10 per GB per month), 8 to 90 days on object storage ($0.02 per GB per month), and older data on archival ($0.004 per GB per month), saving 90% on storage costs.

✓Pre aggregated materialized views allow sub second queries on billions of rows by maintaining summary tables (one row per user per day) updated incrementally, trading freshness for speed.

✓Bloom filters and secondary indexes enable point queries for specific users, reducing partition scans by 90% to 99% at 1% to 5% storage overhead.

📌 Interview Tips

1Partitioning example: Events table partitioned as date=2024-01-15/bucket=00 through bucket=99 based on hash(user_id) mod 100. Query for last 30 days scans 30 dates. Query for user_456 scans all 100 buckets for each of 30 dates (3000 partitions but each much smaller).

2Pre aggregation example: Raw events table has 10 billion rows. Daily active users query scans all rows taking 30 seconds. Pre aggregated user_daily_summary table has 100 million rows (10 million users * 10 days). Same query on summary table returns in 200 milliseconds.

← Back to Event Data Modeling Overview