Big Data SystemsReal-time Analytics (OLAP Engines)Hard⏱️ ~3 min

Multi-Tenancy and Failure Modes in Real-Time OLAP

Multi-tenancy is critical for real-time Online Analytical Processing (OLAP) at scale because a single expensive query or misbehaving tenant can degrade Service Level Agreements (SLAs) for all users. The noisy neighbor problem manifests when one tenant runs wide scans (no time filter), expensive distinct counts, or high cardinality GROUP BYs that saturate Central Processing Unit (CPU), exhaust memory, or fill network buffers. Without isolation, Uber observed scenarios where one analyst's exploratory query scanning 100 million rows spiked p99 latency from 200 milliseconds (ms) to 5 seconds for operational dashboards. Implement per-tenant quotas (maximum queries per second, maximum threads per query, maximum memory per query) and query timeouts (kill queries exceeding 10 to 30 seconds). Admission control rejects or queues requests when tenant budgets are exceeded, preventing cascading failures. LinkedIn runs thousands of use cases with strict quotas: product dashboards get priority queues and higher resource limits, while exploratory analytics tolerate queueing and lower limits. Segment assignment can physically isolate critical tenants to dedicated node pools, fully separating their Input/Output (IO) and compute from shared workloads. Hot partitions and data skew are insidious failure modes. If data is partitioned by time and a single dimension like country equals US, one node handles 80% of traffic while others sit idle. Symptoms include Central Processing Unit (CPU) saturation on specific hosts, long garbage collection pauses, and p99 latency spiking from 50 ms to 500 ms. Airbnb hit this when a high traffic experiment variant concentrated on one shard. Mitigate with composite partitioning (hash of userId concatenated with timestamp) or pre-aggregation per hot key to spread load. Cascading failures occur when one component's degradation amplifies upstream. A cache stampede example: cache expires, 10,000 requests simultaneously hit the database, causing 30 second spike and exhausting connection pools. Query timeouts then trigger retries, doubling load. Thundering herd during deployments: 1000 servers restart simultaneously, overwhelming downstream dependencies like metadata stores or coordination services. Use staggered rollouts, circuit breakers (open after 5 consecutive failures, half open for test requests), and stale while revalidate caching to absorb traffic spikes.
💡 Key Takeaways
Noisy neighbor problem: one tenant's expensive query (wide scan, high cardinality GROUP BY) saturates CPU and spikes p99 latency from 200 ms to 5 seconds for all users without multi-tenancy controls
Implement per-tenant quotas (max queries per second, max threads, max memory) and query timeouts (kill after 10 to 30 seconds); admission control rejects or queues requests exceeding budgets to prevent cascading failures
Hot partitions occur when data is skewed (80% of traffic to country equals US hits one shard); symptoms include CPU saturation on specific hosts and p99 jumping from 50 ms to 500 ms; fix with composite partitioning like HASH of userId plus timestamp
Cache stampede: cache expires, 10,000 requests hit database simultaneously causing 30 second spike and connection pool exhaustion; mitigate with stale while revalidate pattern (serve stale data while refreshing asynchronously)
Cardinality explosion: GROUP BY on multiple high cardinality dimensions (userId, productId, timestamp) produces millions of result rows, exceeding memory and network budgets; enforce result row caps and require pre-aggregated cubes for risky queries
Thundering herd during deployments: 1000 servers restart simultaneously after deploy, overwhelming metadata stores and coordination services; use staggered rollouts (10% every 5 minutes) and circuit breakers to absorb spikes
📌 Examples
LinkedIn isolation: Product dashboards ("Who viewed your profile") get dedicated segment assignments and priority queues with 2x resource limits; exploratory analytics share general pool with lower limits and tolerate queueing during peak traffic
Airbnb hot partition: High traffic experiment variant concentrated bookings from US users on one shard, causing p99 to spike from 100 ms to 800 ms; re-partitioned using HASH of userId plus experimentId to spread load evenly across 20 shards, reducing p99 back to sub-200 ms
Uber cascading failure: Metadata service slowdown (200 ms to 2 sec response) caused query layer to exhaust connection pools; queries timed out and retried, doubling load and crashing entire cluster; added circuit breaker (open after 5 failures, 30 sec cooldown) and increased timeouts from 10 sec to 30 sec to ride through transient spikes
← Back to Real-time Analytics (OLAP Engines) Overview