Message Queues & Streaming • Notification System Design (Push Notifications)Hard⏱️ ~2 min
Campaign Thundering Herds and Audience Segmentation
⚠️ Thundering Herd Anti-Pattern
Building recipient list on-the-fly during campaign launch: marketer clicks send on 5M user campaign → naive implementation queries user database → complex join across users, devices, preferences → 30-60 second query locks tables → timeouts cascade to transactional traffic.
📋 Safe Campaign Flow
1. Marketer defines segment (e.g., CA users with purchases in 30 days)
2. Background job computes audience asynchronously
3. Store user IDs in segment table or object storage, mark ready
4. At send time, chunk into batches of 10K user IDs
5. Publish to queue at controlled pace (2-5K/sec)
2. Background job computes audience asynchronously
3. Store user IDs in segment table or object storage, mark ready
4. At send time, chunk into batches of 10K user IDs
5. Publish to queue at controlled pace (2-5K/sec)
💡 Key Takeaways
✓Building recipient lists on the fly during campaign send can take 30 to 60 seconds, lock database tables, and cause timeouts for transactional queries. Precompute audience segments asynchronously into materialized tables or object storage before send time.
✓Chunk precomputed segments into 10,000 user batches and pace injection at 2,000 to 5,000 per second. Progressive rollout ramps from 1,000 per second over 10 to 30 minutes, monitoring queue age and provider error rates before scaling.
✓Campaign thundering herd failure: injecting 5 million notifications instantly causes preference cache miss storms, rendering Central Processing Unit (CPU) saturation, and provider rate limiter triggers. Pacing spreads load and allows abort on elevated error rates over 3%.
✓Multi tenant fairness requires token bucket per tenant (10,000 per minute) and global budget (50,000 per minute). Without quotas, a single large tenant monopolizes workers and degrades Service Level Objectives (SLOs) for smaller tenants via noisy neighbor effect.
✓Google Firebase Cloud Messaging (FCM) and Apple Push Notification service (APNs) handle millions per tenant, but client side pacing protects your infrastructure. Monitor provider acceptance rates: halt campaign if rejection rate exceeds 5% to prevent token invalidation cascades.
📌 Interview Tips
1E-commerce platform precomputes Black Friday segment (10 million users who abandoned carts in last 7 days) into Amazon Simple Storage Service (S3) partitioned by region. Campaign service reads 10,000 user batches, publishes to Simple Queue Service (SQS) at 5,000 per second, ramping over 20 minutes to full 50,000 per second capacity.
2Social media app launches viral feature announcement to 50 million users. Progressive rollout starts at 1,000 per second, doubles every 2 minutes. At 8,000 per second, Firebase Cloud Messaging (FCM) error rate hits 4%; system halts, cleans stale tokens, and resumes at 5,000 per second with 1% error rate.
3Multi tenant notification platform enforces 10,000 per minute per tenant limit. Large tenant tries to send 1 million notifications instantly; system queues excess, signals back pressure to campaign scheduler, and spreads send over 100 minutes to prevent starving 50 smaller tenants.