Health Check Layers: Liveness, Readiness, and Capacity Signals
Liveness Checks
Liveness checks answer: is the process running at all? They should be narrow and conservative, checking only for fatal wedged states like a hung event loop or deadlocked thread pool. A typical liveness check verifies an internal watchdog tick or thread pool progress, nothing more. Critically, liveness checks must never verify external dependencies. If the database is down and liveness fails, the orchestrator restarts all application instances, making recovery harder. Liveness should only restart truly broken processes that cannot recover without a restart.
Readiness Checks
Readiness checks determine: can this instance serve production traffic at expected QoS (Quality of Service) right now? This is what load balancers use to include or exclude instances from rotation. Readiness should reflect dependency availability, queue depth, and tail latency, returning HTTP 503 when temporarily unable to meet SLOs (Service Level Objectives). Load balancers commonly probe every 5-30 seconds, marking targets unhealthy after 2-5 consecutive failures. At 10-second intervals with 2 failure threshold, detection takes roughly 20-25 seconds.
Capacity Signals
Capacity signals communicate how much traffic an instance should receive, rather than just on/off binary status. Agent checks let applications advertise dynamic weights from 0-100% or maximum connection limits. During partial degradation, an instance can signal 75% weight instead of fully removing itself. This prevents the binary flapping problem where instances oscillate between fully in and fully out of rotation under load pressure. Gradual weight reduction under stress maintains higher aggregate throughput than abrupt removal.
HTTP Response Codes
Return HTTP 200 for healthy, HTTP 503 (Service Unavailable) for temporary unavailability. The 503 signals load balancers to retry elsewhere while keeping the instance in rotation for recovery. Returning 200 during degraded states prevents automatic traffic shifting and violates the health check contract. Some systems also use HTTP 429 (Too Many Requests) to signal capacity limits without indicating unhealthiness.