Layered Rate Limiting Strategy: Combining Multiple Scopes

Combining All Three Scopes
Production systems do not choose one strategy. They layer all three scopes, with each layer catching different threat profiles. Think of it as defense in depth: if one layer fails to catch a problem, the next layer provides backup.
Layer 1: Global (Outermost)
Check first at the edge (CDN, load balancer). Simple counter: "Has the API exceeded 100,000/sec?" If yes, return 503. No identity lookup, minimal computation. Catches: flash crowds, DDoS, cascading failures.
Layer 2: Per IP (Before Auth)
After global check passes, check IP limits. For unauthenticated endpoints (login, signup), this is primary protection. For authenticated, it is an additional abuse signal. Catches: credential stuffing, scraping, single origin attacks. Key: ratelimit:ip:{ip}:{endpoint}:{window}.
Layer 3: Per User (After Auth)
After authentication, check user limits. This is your primary quota enforcement. Different limits by plan tier. Catches: runaway scripts, exceeded quotas, compromised accounts. Key: ratelimit:user:{user_id}:{window}.
The Decision Flow
Request arrives. Check global: over? Reject 503. Check IP: suspicious? Reject 429. Authenticate. Check user: exceeded? Reject 429. All pass: process. Each rejection includes headers: Retry-After, X-RateLimit-Remaining.
Example Configuration
E-commerce API: Global 50K/sec. Per IP: 100/sec anonymous, 1K/sec authenticated. Per user: Free 100/hr, Pro 10K/hr. Each layer protects against different failure modes.

💡 Key Takeaways

✓Stripe layers per account quotas with per endpoint caps and global regional limits; a customer with 10,000 requests per hour account wide might be restricted to 100 requests per minute on payment creation due to backend cost

✓Evaluate limits in sequence from fastest to slowest: per IP in memory (microseconds), per user in cache (0.5 to 2 milliseconds), global distributed (1 to 3 milliseconds), and short circuit on first denial

✓Hot key contention occurs when a celebrity user generates millions of requests; shard their counter across multiple cache nodes or pre allocate dedicated resources to avoid cache hotspots affecting other users

✓Regional imbalances without per region budgets let one geography starve others during surges; allocate quotas per region (e.g., 40% North America, 30% Europe, 30% Asia) and rebalance every 10 to 60 seconds

✓Cache outage fallback to local in memory buckets causes temporary over limit violations (each node enforces independently) but maintains availability; accept 2x to 5x overage during incidents over full outage

✓Instrument each layer separately to track which limits fire most frequently; if per user limits trigger constantly on specific endpoints, implement weighted costs rather than increasing flat rate limits

📌 Interview Tips

1Twitter historically combined per user 15 minute windows with per application limits, isolating abusive apps without penalizing the entire user base and preventing a single compromised app from exhausting user quotas

2GitHub GraphQL API uses point based budgets where each query costs points based on complexity, allowing 5,000 points per hour; a simple user lookup costs 1 point while a deep repository traversal costs 50 points, aligning limits with actual compute cost

← Back to Rate Limit Strategies (Per-User, Per-IP, Global) Overview