Scaling Policies and Observability: Lag Metrics, Autoscaling, and SLOs

Lag as the Primary Scaling Signal
Effective autoscaling for consumer groups requires lag-based metrics, not just CPU or memory. A consumer at 20% CPU may be fully saturated if it is processing only 2 of 10 assigned partitions due to skew, with 8 partitions piling up lag. CPU and memory metrics miss this entirely. The primary signal is per-partition lag: the difference between the latest produced offset and the last committed consumer offset. Lag growth rate is even more important: stable lag (e.g., consistently 10,000 records) indicates the consumer is keeping up; growing lag (10,000 → 20,000 → 40,000 over time) indicates falling behind and requires scaling.
Capacity Planning Math
Calculate required consumers from throughput numbers. If producers emit 100,000 records/sec and each consumer processes 2,000 records/sec, you need 50 consumers to keep up. Add headroom for traffic spikes: at 1.5x headroom, provision 75 consumers. But remember the hard cap: if you only have 50 partitions, scaling beyond 50 consumers leaves extras idle. Partition count must be provisioned for peak parallelism needs. Autoscaling down is also important: over-provisioned consumers waste money. Scale down when lag stays at zero and processing rate is well below capacity.
Key Metrics to Monitor
Track these metrics for healthy consumer groups: Per-partition lag and lag growth rate: the primary health indicator. Consumer throughput (records/sec): compare against expected capacity. Fetch latency: time to retrieve records from broker; spikes indicate broker issues. Processing latency: time to handle each record; growth indicates downstream dependency slowdowns. Rebalance frequency: more than 1-2 per hour indicates configuration problems. Commit failure rate: failed commits leave larger replay windows. Correlate consumer metrics with downstream dependency latencies to identify when external services are throttling your consumers.
Setting SLOs on Lag
Define SLOs (Service Level Objectives) on time-to-process relative to retention. If retention is 7 days and business requires data freshness within 1 hour, your SLO might be: "P99 lag less than 10 minutes; alert when any partition exceeds 30 minutes; page when any partition exceeds 1 hour." This creates actionable tiers. Monitor violation rate: if you breach P99 SLO more than 5% of time, you need more capacity or partition rebalancing. Track lag by partition to identify hot partitions requiring key sharding.
Autoscaling Implementation
Implement autoscaling with hysteresis (delay between consecutive scale events) to avoid thrashing. Example policy: scale up when lag growth rate exceeds 1,000 records/sec for 5 minutes; scale down when lag is zero and throughput is below 50% capacity for 15 minutes. The asymmetry (quick scale-up, slow scale-down) favors availability over cost efficiency. Never scale above partition count; include a ceiling check in autoscaler logic.
Key Insight: Lag and lag growth rate are the essential signals for consumer group health, not CPU or memory. Set SLOs on time-to-process relative to retention, and implement autoscaling with hysteresis to avoid thrashing while respecting the partition count ceiling.

💡 Key Takeaways

✓Lag-based scaling: CPU/memory miss partition skew; per-partition lag and lag growth rate are the primary signals

✓Capacity math: 100K records/sec ÷ 2K/sec per consumer = 50 consumers needed; cannot exceed partition count ceiling

✓Key metrics: lag/growth rate, consumer throughput, fetch/processing latency, rebalance frequency, commit failure rate

✓SLOs: define based on time-to-process relative to retention; example: P99 lag < 10 min, alert at 30 min, page at 1 hour

📌 Interview Tips

1Explain lag vs CPU: consumer at 20% CPU seems fine, but 8 of 10 partitions have growing lag because 2 partitions are hot; CPU misses this

2Capacity calculation: 100K/sec input, 2K/sec per consumer = 50 needed; have 40 partitions; ceiling is 40 consumers; need more partitions first

3Autoscaling policy: scale up when lag growth > 1K/sec for 5 min; scale down when lag = 0 and throughput < 50% for 15 min; quick up, slow down

← Back to Consumer Groups & Load Balancing Overview