Message Queues & Streaming • Notification System Design (Push Notifications)Hard⏱️ ~3 min
Device Token Lifecycle and Provider Failure Modes
Token Volatility:
Device tokens are NOT permanent. APNs tokens rotate on reinstall, device restore, or OS updates. FCM tokens invalidate on uninstall, data clear, or Google detecting suspicious activity. Failing to handle lifecycle creates: wasted capacity, increased provider costs, and throttling risk.
Token Feedback Pipeline:
Dedicated pipeline: workers parse provider responses, extract invalidated tokens, publish to cleanup queue. Separate service batch-processes against token database. Use TTL-based approach: tokens not refreshed within 90 days automatically marked stale and excluded from targeting.
Provider Throttling:
The insidious failure mode: bulk campaign to segment with 30% stale tokens → APNs sees error rate spike from 1% to 30% → triggers exponential backoff or temporary bans → impacts even high-priority notifications.
Mitigation:
Preemptive token hygiene: before large campaigns, validate token freshness by checking last-seen timestamps. Use progressive rollout (start 1K/sec, ramp to 10K/sec over 10 minutes) to detect and halt on elevated error rates before saturating provider capacity.
⚠️ Token Invalidation Cascade
APNs returns 410 Unregistered, FCM returns NotRegistered—immediately mark token inactive. If you continue sending to 10% invalid tokens at 10K/sec, you waste 1K/sec capacity. Worse, APNs will throttle clients with high error rates, degrading delivery for valid users.
💡 Key Takeaways
✓Device tokens rotate on app reinstall, device restore, or uninstall. Sending to 10% invalid tokens at 10,000 per second wastes 1,000 per second capacity and risks provider throttling when error rates spike from 1% to 10% or higher.
✓Apple Push Notification service (APNs) returns 410 Unregistered for invalid tokens; Firebase Cloud Messaging (FCM) returns NotRegistered or InvalidRegistration. Immediately mark these inactive and exclude from future sends to avoid cascading throttling.
✓Implement token feedback pipeline: workers parse provider responses, publish invalidations to cleanup queue, batch processor marks tokens inactive in database. Use 90 day time to live (TTL) to auto expire tokens not refreshed by client apps.
✓Provider throttling failure mode: bulk campaign with 30% stale tokens can spike error rate and trigger exponential backoff or temporary bans, degrading even high priority traffic. Mitigate with progressive rollout ramping 1,000 to 10,000 per second over 10 minutes.
✓Token hygiene checks before campaigns: query last seen timestamps, exclude tokens inactive over 90 days, and run test sends to small cohorts (1,000 users) to measure error rates before full blast. Halt if error rate exceeds 5% to protect provider reputation.
📌 Interview Tips
1Google Firebase Cloud Messaging (FCM) client detects NotRegistered error at 8% rate during holiday campaign. System halts send after 50,000 notifications, cleans invalid tokens via batch update, and resumes at reduced rate after error drops to 2%.
2Apple APNs throttles client sending 100,000 notifications per minute with 15% invalid token rate. Client implements 90 day token expiry, error rate drops to 2%, and APNs lifts throttle after 24 hours of clean traffic.
3Ride sharing app processes token feedback queue at 500 invalidations per second during peak, batch updating DynamoDB token table every 5 seconds (2,500 tokens per batch) to avoid overwhelming database with single row updates.