Real-time Analytics & OLAP • Real-time OLAP ArchitectureHard⏱️ ~4 min
Real-Time OLAP Implementation Architecture
Building the System:
Understanding the implementation details shows you can actually design and build a real-time OLAP system, not just talk about it abstractly. The architecture breaks down into five core subsystems that must work together.
Ingestion Layer Design:
Ingestion tasks consume from a partitioned durable event log. Each partition is processed by one ingestion task to maintain ordering. Tasks transform records (parse JSON, apply schemas, enrich with lookups), enforce schema validation, and handle schema evolution through versioned schemas with backward compatible changes.
Events are buffered in memory and built into real-time segments covering short time windows, typically 5 to 15 minutes. Segments use columnar encoding (separate arrays for each field), compression (dictionary encoding for low cardinality strings, run-length encoding for sorted columns, lightweight compression like LZ4 for others), and basic indexes (sorted indexes on time and primary key, inverted indexes on high selectivity filter columns).
When a segment reaches its time window boundary or size limit (for example, 100 MB to 500 MB), it is sealed, persisted to disk or object storage, and registered with the coordinator. Ingestion tasks checkpoint their log offsets so they can resume from the last committed position on restart.
Segment and Partitioning Strategy:
Partition data by time first (for example, hourly or daily), then optionally by a high cardinality dimension like user ID hash or geographic region to distribute query load. Time partitioning enables efficient pruning: a query for last 15 minutes only scans segments from that range.
Use small segments (100 MB to 500 MB) for recent data to keep ingestion and eviction fast, and larger segments (1 GB to 5 GB) for historical data to reduce metadata overhead and improve scan efficiency through better compression ratios. Segments are immutable once built. Mutations like late event upserts are achieved through upsert indexes (maintaining a separate index of updates to apply during query time) or by creating replacement segments during periodic compaction.
Scaling and Reliability Strategies:
To handle 10 times more read queries, add more servers and horizontally shard segments across them. Ensure brokers can fan out to more nodes without becoming bottlenecks, potentially using hierarchical brokers where leaf brokers handle subsets of servers and root brokers coordinate leaf brokers.
To handle 10 times more write throughput, increase event log partitions and corresponding ingestion tasks. Consider reducing per-record cost by simplifying transformations, reducing index density on hot tables (fewer inverted indexes), or batching smaller segments into larger ones less frequently.
For long-term cost efficiency, use tiered storage. Hot data (last 24 to 48 hours) on SSD with 3x replication for low latency and high availability. Warm data (last 30 days) on cheaper spinning disks with 2x replication, accepting slightly higher query latency. Cold data (older than 30 days) in object storage with single copy, queryable but with relaxed latency SLOs of seconds instead of milliseconds.
For reliability, replicate segments across availability zones so node failures do not cause data loss. Use checkpointing in ingestion tasks so they can restart from well-defined log offsets after crashes. Track ingestion lag per partition as a critical metric and trigger alerts when lag exceeds SLO thresholds. Implement data validation and reconciliation jobs that periodically compare aggregate metrics between real-time OLAP output and the authoritative data lake or warehouse to detect drift exceeding acceptable bounds (for example, more than 1% difference). Design brokers to fail over transparently when servers go down by re-routing queries to replica segments on healthy servers.
✓ In Practice: At scale, a table might have 50,000 segments: 1,000 small real-time segments for last 6 hours, 49,000 larger historical segments covering months. Metadata management becomes a bottleneck if not designed carefully.
Indexing and Pre-Aggregation:
Build per-column dictionaries to map string values to integer IDs, enabling fast equality filters and low memory group-by operations. Create sorted indexes on time and frequently filtered columns to enable binary search. Build inverted indexes (value to list of row IDs) on high selectivity columns to accelerate filters like country = 'US'.
For common query patterns, maintain pre-aggregated structures. Rollup cubes store pre-computed aggregates at multiple granularities (for example, clicks by hour, by day, by week), trading 2 to 3 times storage for 10 to 100 times faster query time for exact match queries. Specialized tree indexes like star-tree pre-aggregate combinations of dimensions, enabling sub-second response for top N queries without scanning raw rows.
Balance index overhead against write throughput. Each additional index adds 10 to 30% to segment build time and storage. Over-indexing on high write volume tables can reduce ingestion throughput from 500,000 to 100,000 events per second per ingestion task.
Serving Topology:
Brokers are stateless frontends that parse SQL queries, plan execution (determine which segments and servers to query based on filters and time ranges), route sub-queries to appropriate servers, and merge partial results (combining aggregates, sorting top N results).
Servers host segments in memory or on local SSD, execute scan and aggregation operators using vectorized execution and SIMD instructions, maintain local result caches to avoid recomputing identical queries, and respond to broker sub-queries.
A controller or coordinator acts as the control plane, tracking segment locations across servers (maintaining a routing table mapping segment ID to server list), orchestrating rebalancing when servers are added or removed (moving segments to new servers), coordinating segment lifecycle from real-time to historical tiers (triggering compaction and tier migration), and performing health checks and failure detection.
Scaling Real-World Systems
10x
MORE READS
Add Servers
SHARD SEGMENTS
💡 Key Takeaways
✓Real-time segments are 100 MB to 500 MB for fast ingestion and eviction, historical segments are 1 GB to 5 GB for better compression and scan efficiency
✓Each additional index adds 10 to 30% to segment build time and storage; over-indexing reduces ingestion throughput from 500,000 to 100,000 events per second per task
✓Pre-aggregated rollup cubes trade 2 to 3 times storage for 10 to 100 times faster query time on exact match queries
✓Tiered storage puts hot data (last 24 to 48 hours) on SSD with 3x replication, warm data (last 30 days) on cheaper disks with 2x replication, cold data (older) in object storage with single copy
✓Brokers are stateless query planners that route sub-queries to servers holding relevant segments; servers execute scans using vectorized execution and SIMD; controller tracks segment locations and orchestrates rebalancing
📌 Examples
1Table with 10 billion events per day uses 5 minute segment windows during real-time ingestion, creating 288 segments per day. After 6 hours, segments are compacted into 6 larger hourly segments. After 7 days, daily segments are created with heavy compression, reducing 168 hourly segments to 7 daily segments.
2Query for revenue by region in last hour hits broker, which identifies 12 relevant segments across 8 servers. Broker sends 8 parallel sub-queries. Each server scans its segments (1 to 2 segments per server), aggregates locally, returns partial results. Broker merges 8 partial results in 15 milliseconds, total query time 220 milliseconds.
3Scaling reads: System handles 500 queries per second at p95 300 milliseconds with 20 servers. Traffic grows to 5,000 QPS. Team adds 180 servers (10x total), re-shards segments across new servers. Brokers now fan out to 200 servers per query but each server scans fewer segments. P95 latency stays at 320 milliseconds.