Message Queues & StreamingNotification System Design (Push Notifications)Medium⏱️ ~3 min

Push Notification System Architecture Overview

Definition
A push notification system delivers messages from your servers to users' devices—the "Your driver is arriving" alert, the "Someone liked your photo" ping, or the "Your order shipped" update.
The Core Challenge: When you tap "Order" on a food delivery app, you expect updates. But the app cannot talk directly to your phone—it must go through Apple (APNs) or Google (FCM) first. These providers maintain always-on connections to billions of devices. Your job is getting the right message to them, at the right time, without overwhelming your own systems. Why Not Just Send Directly: Imagine 10 million users and a flash sale announcement. If your server tried to call Apple/Google for each user one by one, it would take hours and probably crash. Instead, you queue notifications internally, let dedicated workers process them in parallel, and throttle the flow to match what providers can handle. This is why notification systems are built around message queues. The Two-Hop Path: Every push notification makes two hops: first from your server to the provider (Apple or Google), then from the provider to the device. Each hop can fail independently. Your server might succeed in sending to Apple, but the user's phone is off. Or Apple might reject your request because the device token expired. Understanding these two hops is key to building reliable notification systems. Capacity Reality: For 1 million daily active users receiving 10 notifications each, you are handling roughly 100 notifications per second on average—but flash sales or breaking news can spike this 100x. Your architecture must handle both the steady flow and the sudden surge.
💡 Key Takeaways
Asynchronous queue based architecture decouples producers from delivery, allowing systems to buffer spikes and prevent blocking. At 10,000 notifications per second peak, you need roughly 100 workers processing 100 messages per second each.
At least once delivery semantics require idempotency keys to prevent duplicates. Store idempotency records keyed by tenant, user, campaign, and collapse key with 24 to 48 hour time to live (TTL).
Priority lane isolation is critical: separate topics and worker pools for high priority (one time password, fraud) versus promotional traffic ensure sub 1 second latency for critical flows even during bulk campaigns.
Push delivery is two phase: server to provider typically takes tens of milliseconds, but provider to device varies with radio state and operating system power management. Android Doze can defer normal priority messages for minutes.
Storage splits into hot cache layer for preferences (single digit millisecond reads via DynamoDB with DAX cache) and cold append only tracking store. For 10 million sends per day, expect roughly 2 GB daily growth in tracking data.
📌 Interview Tips
1AWS reference architecture: Events published to Amazon Simple Notification Service (SNS) Standard topics (30,000 messages per second capacity) fan out to Amazon Simple Queue Service (SQS) queues consumed by Lambda or container workers. Preferences stored in DynamoDB with DAX cache (minimum 3 nodes) for sub 10 millisecond reads at 2,000 to 5,000 events per second.
2Apple APNs uses HTTP/2 with multiplexed streams and token based authentication. Collapse identifiers coalesce updates (for example, cart count badges) so only the latest value reaches the device, reducing notification fatigue.
3Google FCM provides high versus normal priority levels. High priority wakes devices immediately but misuse gets deprioritized by Google. Normal priority respects Doze mode and can defer delivery by minutes to hours depending on device state.
← Back to Notification System Design (Push Notifications) Overview
Push Notification System Architecture Overview | Notification System Design (Push Notifications) - System Overflow