Global Rate Limiting: Service Wide Capacity Protection

The Last Line of Defense
Per user and per IP limits protect against individual bad actors. But what if EVERYONE is legitimate and they all show up at once? Flash sales, viral content, breaking news: suddenly your service sees 10x normal traffic from normal users. Per user limits do not help because each user is within their limit; there are just too many users. Global rate limiting protects your infrastructure from total overload.
Global Rate Limit Mechanics
Instead of tracking per identity, track total requests across the entire API or service. One counter: "total API requests this second." When it hits capacity (say 100,000/sec), reject additional requests regardless of who sends them. This is load shedding: deliberately dropping traffic to protect core functionality. Key format: ratelimit:global:{service}:{window}.
Setting the Global Limit
Base it on actual capacity, not wishful thinking. If your database handles 50,000 queries/sec before latency degrades, your global limit should be at or below that. Measure under load: run load tests to find the breaking point, then set limits at 80% of that. Account for downstream dependencies: if payment service handles 1,000/sec, checkout endpoint global limit cannot exceed that.
Fairness Concerns
Global limits are inherently unfair: first come first served means early arrivals succeed while later arrivals fail. At exactly 100,000/sec capacity with 110,000/sec demand, 10,000 random users get rejected. Mitigation: combine with per user limits so heavy users hit their individual ceiling before consuming global capacity. Premium users can have reservation: "reserve 10% of global capacity for enterprise tier."
Implementation Location
Check global limits as early as possible: at the edge (CDN, load balancer) before requests reach your application servers. This protects the entire stack. If using distributed check (Redis), make it a single atomic operation to avoid race conditions. Accept some imprecision: under extreme load, distributed counters might allow 105,000 instead of exactly 100,000. This is acceptable.

💡 Key Takeaways

✓AWS API Gateway enforces regional account limits around 10,000 requests per second steady state with burst capacity of several thousand, protecting shared infrastructure from any single tenant

✓Global limits prevent cascading failures when per user and per IP limits fail to stop distributed traffic spikes like viral events or buggy client retry loops across thousands of devices

✓Self denial of service risk means a misconfigured global limit starves all legitimate users simultaneously; mitigate with priority classes that exempt critical traffic like health checks and payments

✓Centralized enforcement guarantees accuracy but adds 0.5 to 2 milliseconds per request plus availability risk; distributed approximation achieves 95% to 98% accuracy with local microsecond decisions

✓Implementation often uses hierarchical budgets: allocate tokens per region or per availability zone, then subdivide to nodes, allowing rebalancing without global coordination on every request

✓Reserve 10% to 20% of global capacity for retries and background jobs to maintain responsiveness during partial outages when retry rates spike

📌 Interview Tips

1A distributed global limiter might allocate each of 100 nodes 100 requests per second locally (10,000 total), then every 5 seconds sync actual usage to a coordinator that redistributes quotas to nodes experiencing higher load

2Priority queues can be implemented by reserving 2,000 requests per second of a 10,000 total for Tier 1 traffic (payments, auth) and allowing Tier 2 traffic (analytics, reports) to use only the remaining 8,000

← Back to Rate Limit Strategies (Per-User, Per-IP, Global) Overview