Guardrail Failure Modes and Mitigation Strategies

Core Concept
Guardrail failures occur when the system either misses real harm (false negatives) or blocks good experiments (false positives). Both have significant costs.
False Negatives: Missing Real Harm
Guardrail doesnt fire when treatment actually harms users. Causes: threshold too loose, metric not sensitive enough, delay too long (harm compounds before detection), wrong metric (measuring proxy instead of true outcome). Cost: user harm ships to production.
Mitigation: tighten thresholds, add more sensitive metrics, shorten detection windows, validate guardrails against known-bad experiments retrospectively.
False Positives: Blocking Good Experiments
Guardrail fires when treatment is actually fine. Causes: threshold too tight, high metric variance, multiple testing without correction, outliers in small samples. Cost: good features delayed or abandoned, team loses trust in guardrail system.
⚠️ Key Trade-off: Tightening thresholds to reduce false negatives increases false positives, and vice versa. The optimal point depends on relative costs of shipping harm vs blocking good features.
System Failures
Pipeline failures: logging gaps, aggregation bugs, comparison errors. Detection: run guardrails on A/A experiments (should never fire). Monitoring: track guardrail fire rate over time - sudden changes indicate system issues, not treatment effects.

💡 Key Takeaways

✓False negatives: threshold too loose, insensitive metric, long detection delay

✓False positives: threshold too tight, high variance, multiple testing, outliers

✓Optimal threshold depends on relative cost of shipping harm vs blocking good features

✓Validate system with A/A experiments (should never fire) and fire rate monitoring

📌 Interview Tips

1When explaining false negatives: describe harm shipping because threshold was 10% when 5% degradation occurred

2For system validation: run guardrails on A/A experiments to verify they dont false-fire

← Back to Guardrail Metrics Overview