Rate Limiting • Rate Limit Strategies (Per-User, Per-IP, Global)Hard⏱️ ~3 min
Layered Rate Limiting Strategy: Combining Multiple Scopes
Production systems stack per user, per IP, and global limits in layers to catch different attack vectors and failure modes. Each layer targets a specific threat: per user limits enforce fairness between authenticated tenants, per IP limits stop anonymous DDoS and credential stuffing, and global limits act as a final safety net when upstream anomalies bypass the first two layers. The key is setting thresholds so each layer activates for its intended scenario without creating false positives.
Stripe exemplifies this approach by combining per account limits with per endpoint caps and adaptive global backstops across multiple regions. A single customer might have a 10,000 requests per hour account quota, but payment creation endpoints within that account have tighter 100 requests per minute caps due to higher backend cost. Simultaneously, Stripe enforces per IP limits on unauthenticated endpoints and maintains regional global caps to prevent any traffic pattern from overwhelming their payment processing infrastructure. This multi scope defense means an attacker must evade all layers simultaneously.
The implementation challenge is orchestrating checks without adding excessive latency. Most systems evaluate limits in sequence: first check the fastest local limit (per IP in memory), then per user (cached counter lookup, 0.5 to 2 milliseconds), and only if those pass proceed to global checks or weighted cost calculations. Short circuit on the first denial to minimize work. Instrument each layer separately so you can track which limits trigger most often and tune thresholds independently. For example, if per IP limits rarely trigger but per user limits fire constantly for specific endpoints, you may need weighted costs rather than flat rate limits.
Edge cases require careful handling. During a cache outage, falling back to local in memory buckets can cause temporary per user limit violations (each node enforces independently) but prevents a complete service outage. Hot key problems arise when a single celebrity user creates massive counter contention; shard their key across multiple cache nodes or pre allocate dedicated resources. Regional imbalances occur when one geographic region generates a surge: if your global limit lacks per region budgets, Europe might starve Asia. Allocate regional quotas (e.g., 40% North America, 30% Europe, 30% Asia) and rebalance periodically based on actual traffic patterns.
💡 Key Takeaways
•Stripe layers per account quotas with per endpoint caps and global regional limits; a customer with 10,000 requests per hour account wide might be restricted to 100 requests per minute on payment creation due to backend cost
•Evaluate limits in sequence from fastest to slowest: per IP in memory (microseconds), per user in cache (0.5 to 2 milliseconds), global distributed (1 to 3 milliseconds), and short circuit on first denial
•Hot key contention occurs when a celebrity user generates millions of requests; shard their counter across multiple cache nodes or pre allocate dedicated resources to avoid cache hotspots affecting other users
•Regional imbalances without per region budgets let one geography starve others during surges; allocate quotas per region (e.g., 40% North America, 30% Europe, 30% Asia) and rebalance every 10 to 60 seconds
•Cache outage fallback to local in memory buckets causes temporary over limit violations (each node enforces independently) but maintains availability; accept 2x to 5x overage during incidents over full outage
•Instrument each layer separately to track which limits fire most frequently; if per user limits trigger constantly on specific endpoints, implement weighted costs rather than increasing flat rate limits
📌 Examples
Twitter historically combined per user 15 minute windows with per application limits, isolating abusive apps without penalizing the entire user base and preventing a single compromised app from exhausting user quotas
GitHub GraphQL API uses point based budgets where each query costs points based on complexity, allowing 5,000 points per hour; a simple user lookup costs 1 point while a deep repository traversal costs 50 points, aligning limits with actual compute cost