A/B Testing & ExperimentationGuardrail MetricsHard⏱️ ~3 min

Guardrail Failure Modes and Mitigation Strategies

Core Concept
Guardrail failures occur when the system either misses real harm (false negatives) or blocks good experiments (false positives). Both have significant costs.

False Negatives: Missing Real Harm

Guardrail doesnt fire when treatment actually harms users. Causes: threshold too loose, metric not sensitive enough, delay too long (harm compounds before detection), wrong metric (measuring proxy instead of true outcome). Cost: user harm ships to production.

Mitigation: tighten thresholds, add more sensitive metrics, shorten detection windows, validate guardrails against known-bad experiments retrospectively.

False Positives: Blocking Good Experiments

Guardrail fires when treatment is actually fine. Causes: threshold too tight, high metric variance, multiple testing without correction, outliers in small samples. Cost: good features delayed or abandoned, team loses trust in guardrail system.

⚠️ Key Trade-off: Tightening thresholds to reduce false negatives increases false positives, and vice versa. The optimal point depends on relative costs of shipping harm vs blocking good features.

System Failures

Pipeline failures: logging gaps, aggregation bugs, comparison errors. Detection: run guardrails on A/A experiments (should never fire). Monitoring: track guardrail fire rate over time - sudden changes indicate system issues, not treatment effects.

💡 Key Takeaways
False negatives: threshold too loose, insensitive metric, long detection delay
False positives: threshold too tight, high variance, multiple testing, outliers
Optimal threshold depends on relative cost of shipping harm vs blocking good features
Validate system with A/A experiments (should never fire) and fire rate monitoring
📌 Interview Tips
1When explaining false negatives: describe harm shipping because threshold was 10% when 5% degradation occurred
2For system validation: run guardrails on A/A experiments to verify they dont false-fire
← Back to Guardrail Metrics Overview
Guardrail Failure Modes and Mitigation Strategies | Guardrail Metrics - System Overflow