Real-time Analytics & OLAPReal-time OLAP ArchitectureHard⏱️ ~4 min

Failure Modes and Edge Cases in Real-Time OLAP

Where Real-Time OLAP Breaks: Understanding failure modes is critical for interviews because it shows you have built or operated these systems at scale. Real-time OLAP architectures fail in characteristic ways that require specific mitigation strategies. Data Freshness and Ingestion Lag: If ingestion lags behind the event log, dashboards show stale data and Service Level Objective (SLO) violations occur. A spike in input volume, a slow transformation step, or hot partitions can cause ingestion delay to grow from seconds to many minutes. Operators monitor ingestion lag per partition (difference between latest event timestamp in log versus latest processed timestamp) and set alerts when lag exceeds thresholds.
Ingestion Lag Cascade
NORMAL
8 sec
TRAFFIC SPIKE
45 sec
OVERLOAD
8 min
Mitigation includes autoscaling ingestion tasks based on lag metrics, backpressure mechanisms to slow producers when consumers cannot keep up, and partitioning hot keys across more ingestion workers. At high scale, you might add dedicated ingestion clusters for critical tables with strict freshness SLOs. Dual Pipeline Divergence: In Lambda style architectures, the real-time path and batch path might compute metrics slightly differently due to code drift, different handling of edge cases, or schema evolution issues. When real-time segments age out and queries fall back to historical segments, numbers can jump. Yesterday's conversion rate suddenly changes overnight without any actual business change. This is particularly insidious because it erodes trust in the system. Engineers spend days debugging phantom issues. Best practice is treating both paths as first class citizens with shared business logic libraries, comprehensive integration tests comparing outputs, and automated reconciliation jobs that detect drift above acceptable thresholds (for example, more than 1% difference for a given metric over a given time window). Late, Out of Order, and Duplicate Events: Events for a given time window may arrive minutes or hours late due to mobile clients going offline, batch uploads, or clock skew. If the system only aggregates based on ingestion time, metrics will drift as late data trickles in. If you count clicks by hour using ingestion time and 10% of clicks arrive 2 hours late, your hourly metrics are permanently wrong. Strategies include event time windows (using event timestamp rather than processing timestamp), watermarking (define how late you will wait for events before finalizing a window), reprocessing (allow segments to be rebuilt when late data arrives), and upserts (update existing aggregations when late events arrive). Each has cost: watermarking delays results, reprocessing adds complexity, upserts slow ingestion and increase index maintenance overhead. The choice depends on how much lateness you expect and how critical accuracy is.
❗ Remember: Duplicate events can occur due to retries, at-least-once delivery semantics, or producer bugs. Without deduplication based on event IDs, you double count metrics. Deduplication adds state management and increases ingestion latency.
Skew and Hot Spots: If queries or data are skewed, some nodes handle disproportionate load. Consider 90% of queries filtering on United States region and last 15 minutes time range. The partitions holding this slice become hot, CPU utilization hits 100%, and p99 latency jumps from 500 milliseconds to 5 seconds while other nodes are idle. Mitigations include better partitioning strategies (hash by high cardinality keys to distribute load), replication of hot segments across more nodes, pre-aggregation for common filter combinations to reduce scan cost, and query result caching at the broker level. At extreme scale, some systems maintain shadow copies of hot partitions with different sort orders optimized for different query patterns. Node and Segment Failures: Real-time segments stored only in memory are vulnerable to process crashes or node failures. If an ingestion task crashes before persisting its in-memory segment to disk, the last few minutes of data are lost. Systems must persist to disk or replicated storage quickly (for example, every 2 to 5 minutes) and replicate segments across availability zones. On the query side, if a subset of segments is unavailable (disk failure, network partition), you must choose: fail the query with an error, return partial results with a warning, or fall back to slightly older data from backup replicas. The choice depends on your consistency requirements and user expectations. Financial dashboards might fail queries rather than show partial data; operational monitoring might prefer partial results over no results. Query Overload: A sudden burst of heavy ad hoc queries (multi-join scans over multiple hours of data) can starve dashboards of resources. One analyst running an exploratory query that scans 6 hours across 200 nodes can consume so much CPU that p95 latency for all other queries degrades from 300 milliseconds to 3 seconds. Production systems enforce query quotas (each user or team gets max concurrent queries or max total CPU seconds per hour), timeouts (queries exceeding time limits are killed), and resource isolation (critical dashboards use dedicated query capacity separate from ad hoc exploration). Some systems support query prioritization, where dashboard queries get higher CPU shares than batch exploration queries.
💡 Key Takeaways
Ingestion lag violations occur when spikes or hot partitions cause delay to grow from seconds to minutes; monitor lag per partition and implement autoscaling and backpressure mechanisms
Dual pipeline divergence causes metrics to jump overnight when real-time segments age out; mitigate with shared business logic, integration tests, and automated reconciliation detecting over 1% drift
Late events arriving hours after their event time cause metric drift if using ingestion time; event time windows and watermarking trade accuracy for freshness, while upserts add ingestion overhead
Hot partitions occur when 90% of queries target same slice (like US region, last 15 minutes), causing p99 latency to jump from 500 milliseconds to 5 seconds while other nodes idle
Query overload from one heavy ad hoc query scanning 6 hours across 200 nodes can degrade p95 latency from 300 milliseconds to 3 seconds for all users; enforce quotas, timeouts, and resource isolation
📌 Examples
1Production incident: Flash sale causes 3x traffic spike, ingestion lag grows from 8 seconds to 8 minutes. Dashboard shows stale conversion rates. Team adds 5 more ingestion workers, lag returns to normal in 12 minutes. Post-mortem adds autoscaling rule triggered at 45 second lag threshold.
2Lambda architecture inconsistency: Real-time path handles null <code>country</code> field by defaulting to Unknown, batch path filters out nulls entirely. When real-time segments age out at midnight, country distribution metrics shift by 3%. Engineers spend 2 days debugging before discovering the divergence in null handling logic.
3Mobile app events: 15% of mobile events arrive 30 to 120 minutes late due to users being offline. System uses 2 hour watermark, accepting 2 hour delay before finalizing hourly aggregations. Alternative would be rebuilding segments when late data arrives, but reprocessing cost deemed too high for this use case.
← Back to Real-time OLAP Architecture Overview
Failure Modes and Edge Cases in Real-Time OLAP | Real-time OLAP Architecture - System Overflow