Message Queues & Streaming • Notification System Design (Push Notifications)Hard⏱️ ~3 min
Device Token Lifecycle and Provider Failure Modes
Device tokens and registration identifiers are not permanent. Apple Push Notification service (APNs) tokens rotate when users reinstall apps, restore devices, or in rare cases after operating system updates. Firebase Cloud Messaging (FCM) registration tokens can be invalidated when apps are uninstalled, users clear data, or Google detects suspicious activity. Failing to handle token lifecycle creates multiple failure modes: wasted capacity sending to dead tokens, increased provider costs, and risk of throttling when error rates spike.
When APNs or FCM reject a notification with a token invalidation error (APNs returns 410 Unregistered, FCM returns NotRegistered or InvalidRegistration), you must immediately mark that token as inactive in your database and stop sending. If your system continues sending to 10% invalid tokens at 10,000 notifications per second, you waste 1,000 per second of capacity and accrue unnecessary provider charges. Worse, APNs will throttle clients that persistently send to invalid tokens, degrading delivery for valid users.
Implement token feedback processing as a dedicated pipeline. Workers should parse provider responses, extract invalidated tokens, and publish to a token cleanup queue. A separate service batch processes these invalidations against your token database, marking tokens inactive and optionally logging for audit. For scale, use a time to live (TTL) based approach: tokens not refreshed by the client app within 90 days are automatically marked stale and excluded from targeting, reducing the invalidation backlog.
Provider throttling is the insidious failure mode. If you suddenly send a bulk campaign to a segment with 30% stale tokens, APNs may see error rate spike from baseline 1% to 30% and begin throttling your traffic with exponential backoff or temporary bans. This cascades to impact even high priority notifications. The mitigation is preemptive token hygiene: before large campaigns, validate token freshness by checking last seen timestamps, and use progressive rollout (start at 1,000 per second, ramp to 10,000 per second over 10 minutes) to detect and halt on elevated error rates before saturating provider capacity.
💡 Key Takeaways
•Device tokens rotate on app reinstall, device restore, or uninstall. Sending to 10% invalid tokens at 10,000 per second wastes 1,000 per second capacity and risks provider throttling when error rates spike from 1% to 10% or higher.
•Apple Push Notification service (APNs) returns 410 Unregistered for invalid tokens; Firebase Cloud Messaging (FCM) returns NotRegistered or InvalidRegistration. Immediately mark these inactive and exclude from future sends to avoid cascading throttling.
•Implement token feedback pipeline: workers parse provider responses, publish invalidations to cleanup queue, batch processor marks tokens inactive in database. Use 90 day time to live (TTL) to auto expire tokens not refreshed by client apps.
•Provider throttling failure mode: bulk campaign with 30% stale tokens can spike error rate and trigger exponential backoff or temporary bans, degrading even high priority traffic. Mitigate with progressive rollout ramping 1,000 to 10,000 per second over 10 minutes.
•Token hygiene checks before campaigns: query last seen timestamps, exclude tokens inactive over 90 days, and run test sends to small cohorts (1,000 users) to measure error rates before full blast. Halt if error rate exceeds 5% to protect provider reputation.
📌 Examples
Google Firebase Cloud Messaging (FCM) client detects NotRegistered error at 8% rate during holiday campaign. System halts send after 50,000 notifications, cleans invalid tokens via batch update, and resumes at reduced rate after error drops to 2%.
Apple APNs throttles client sending 100,000 notifications per minute with 15% invalid token rate. Client implements 90 day token expiry, error rate drops to 2%, and APNs lifts throttle after 24 hours of clean traffic.
Ride sharing app processes token feedback queue at 500 invalidations per second during peak, batch updating DynamoDB token table every 5 seconds (2,500 tokens per batch) to avoid overwhelming database with single row updates.