Trade-offs: Helpfulness vs Harmlessness
The Mechanics
When you train a model with RLHF focused on safety, you are essentially teaching it to be cautious. The reward model penalizes any output that could potentially violate policy, which makes the model more likely to refuse requests. Anthropic's research shows that as models scale and receive more safety training, they become harder to jailbreak (attack success rate drops) but also more evasive (refusal rate on benign prompts increases). The math looks like this. Suppose your baseline model has a 1.0 percent attack success rate on harmful prompts and a 2.0 percent over refusal rate on benign prompts. After aggressive safety tuning, attack success drops to 0.1 percent, but over refusal jumps to 8.0 percent. For a consumer product serving 10 million daily queries, that is 800,000 unnecessary refusals per day versus the original 200,000. Users perceive the model as "dumbed down" or overly cautious, even though it is objectively safer.
Decision Framework by Use Case
The optimal point on the helpfulness versus harmlessness curve depends entirely on your product context and risk tolerance. For consumer chatbots (ChatGPT, Claude consumer), teams typically target attack success below 0.5 percent for critical harms like self harm or child safety, while keeping over refusal under 5.0 percent. Users tolerate some false refusals, but if the model refuses too often, engagement drops and users switch to competitors. The priority is balancing safety with a smooth user experience. For enterprise APIs (OpenAI API, Anthropic API), customers often want less restrictive models because they are building their own safety layers on top. Here you might accept 1.0 to 2.0 percent attack success on less critical categories, keeping over refusal under 3.0 percent. The assumption is that enterprise customers will add application specific filtering and monitoring. For high risk domains (medical advice, financial guidance), you need extremely conservative settings. Attack success for any harmful output might need to be below 0.01 percent, accepting 10 to 15 percent over refusal rates. In these contexts, incorrectly providing harmful information has severe consequences (lawsuits, regulatory action, physical harm), so aggressive refusal is worth the usability hit.
Measuring the Trade-off
Production teams measure both sides explicitly. You maintain two test sets: an adversarial set with known harmful prompts (measuring attack success rate) and a benign set with legitimate queries that should never be refused (measuring over refusal rate). After every training run, you plot both metrics and require that neither regresses beyond defined thresholds. Anthropic's published work on Constitutional AI and Reinforcement Learning from AI Feedback (RLAIF) explicitly explores this Pareto frontier. They show that you can shift the curve with better training data and reward modeling, but you cannot eliminate the trade off. Some models are strictly better (lower attack success and lower over refusal), but within a given training regime, you always face the choice of where to sit on the curve.
Adjusting Thresholds in Production
Even after deployment, teams tune this balance dynamically. You can adjust sampling temperature, system prompt instructions, or post processing filter thresholds to shift the trade off without retraining. For example, increasing filter sensitivity might move attack success from 0.3 to 0.2 percent but push over refusal from 4.0 to 6.0 percent. Product teams often A/B test different operating points. Suppose variant A has 0.2 percent attack success and 6.0 percent over refusal, while variant B has 0.4 percent attack success and 3.0 percent over refusal. You measure downstream metrics: user engagement, session length, retention, and safety incident reports. The winning variant is the one that optimizes business metrics while staying within acceptable safety bounds.
When Product and Safety Teams Conflict
This is a common interview scenario. Product wants fewer refusals to improve user experience. Safety wants stricter controls to reduce risk. The resolution is not political, it is analytical. You need to quantify the trade off: "Reducing over refusal from 5.0 to 3.0 percent increases attack success from 0.2 to 0.5 percent. At 10 million daily queries, that is 200,000 fewer annoyed users but 30 more harmful outputs per day. What is the cost of each outcome?" This forces the discussion from feelings to numbers.