Message Queues & StreamingNotification System Design (Push Notifications)Hard⏱️ ~2 min

Campaign Thundering Herds and Audience Segmentation

Building a recipient list on the fly during campaign launch creates a thundering herd that can bring down the entire notification system. When a marketer clicks send on a campaign targeting 5 million users, a naive implementation queries the user database to resolve active device tokens, overwhelming the primary database with a complex join across users, devices, and preferences. This query might take 30 to 60 seconds and lock tables, causing timeouts for concurrent transactional traffic. Even if it succeeds, flooding the notification queue with 5 million messages in a burst creates downstream cascades: preference cache misses spike, rendering services run out of Central Processing Unit (CPU), and provider rate limiters trigger. The solution is precomputed audience segments materialized into fast lookup stores. When marketers define a segment (for example, users in California who made purchases in the last 30 days), a background job computes the audience asynchronously, stores user identifiers in a dedicated segment table or object storage, and marks the segment ready. At send time, the campaign service chunks this precomputed list into batches of 10,000 user identifiers and publishes to the notification queue at a controlled pace (for example, 2,000 to 5,000 per second), respecting both system capacity and provider limits. Pacing and progressive rollout prevent the thundering herd at delivery time. Instead of injecting 5 million notifications instantly, ramp over 10 to 30 minutes: start at 1,000 per second, monitor queue depth and error rates, and double throughput every 2 minutes until reaching target rate or hitting back pressure thresholds (for example, queue age over 10 seconds or provider error rate over 3%). If errors spike, halt the campaign automatically and alert operators. Google Firebase Cloud Messaging (FCM) and Apple Push Notification service (APNs) handle millions of notifications per tenant, but client side pacing prevents overwhelming your own infrastructure and allows graceful abort if the campaign targets stale tokens or triggers user complaints. Multi tenant fairness compounds the problem. Without per tenant quotas, a single large tenant can monopolize workers and queue capacity, starving smaller tenants. Implement token bucket rate limiting per tenant (for example, 10,000 per minute) and a global budget (for example, 50,000 per minute across all tenants). Track concurrent sends per tenant and enforce limits; queue excess sends with back pressure signaling to the campaign scheduler. This prevents noisy neighbor effects where one tenant's bulk campaign degrades Service Level Objectives (SLOs) for all.
💡 Key Takeaways
Building recipient lists on the fly during campaign send can take 30 to 60 seconds, lock database tables, and cause timeouts for transactional queries. Precompute audience segments asynchronously into materialized tables or object storage before send time.
Chunk precomputed segments into 10,000 user batches and pace injection at 2,000 to 5,000 per second. Progressive rollout ramps from 1,000 per second over 10 to 30 minutes, monitoring queue age and provider error rates before scaling.
Campaign thundering herd failure: injecting 5 million notifications instantly causes preference cache miss storms, rendering Central Processing Unit (CPU) saturation, and provider rate limiter triggers. Pacing spreads load and allows abort on elevated error rates over 3%.
Multi tenant fairness requires token bucket per tenant (10,000 per minute) and global budget (50,000 per minute). Without quotas, a single large tenant monopolizes workers and degrades Service Level Objectives (SLOs) for smaller tenants via noisy neighbor effect.
Google Firebase Cloud Messaging (FCM) and Apple Push Notification service (APNs) handle millions per tenant, but client side pacing protects your infrastructure. Monitor provider acceptance rates: halt campaign if rejection rate exceeds 5% to prevent token invalidation cascades.
📌 Examples
E-commerce platform precomputes Black Friday segment (10 million users who abandoned carts in last 7 days) into Amazon Simple Storage Service (S3) partitioned by region. Campaign service reads 10,000 user batches, publishes to Simple Queue Service (SQS) at 5,000 per second, ramping over 20 minutes to full 50,000 per second capacity.
Social media app launches viral feature announcement to 50 million users. Progressive rollout starts at 1,000 per second, doubles every 2 minutes. At 8,000 per second, Firebase Cloud Messaging (FCM) error rate hits 4%; system halts, cleans stale tokens, and resumes at 5,000 per second with 1% error rate.
Multi tenant notification platform enforces 10,000 per minute per tenant limit. Large tenant tries to send 1 million notifications instantly; system queues excess, signals back pressure to campaign scheduler, and spreads send over 100 minutes to prevent starving 50 smaller tenants.
← Back to Notification System Design (Push Notifications) Overview
Campaign Thundering Herds and Audience Segmentation | Notification System Design (Push Notifications) - System Overflow