Underrefusal

A failure mode where an LLM complies with requests that violate policy or safety guidelines.

What is Underrefusal?

Underrefusal is an LLM failure mode where the model complies with requests that should have been refused because they violate policy or safety guidelines. In practice, it means the model is too willing to answer when it should draw a firm boundary.

Understanding Underrefusal

Underrefusal sits on the opposite side of the safety spectrum from overrefusal. A model that underrefuses may still sound helpful and fluent, but it fails to recognize disallowed intent, unsafe instructions, or policy-bound requests that require a refusal or a safer redirect. OpenAI’s public safety materials describe refusal evaluations specifically for disallowed content and jailbreaks, which is the kind of testing used to catch this behavior. (openai.com)

For builders, underrefusal is usually not a single bug, but a calibration problem across prompts, policies, classifiers, and fine-tuning. If the model is optimized too heavily for helpfulness, or if the safety boundary is underspecified, it may answer questions it should not. That is why modern safety stacks test both refusal behavior and safe completion behavior, so teams can measure whether the model refuses when it should and stays useful when it should not refuse. (openai.com)

Key aspects of underrefusal include:

Unsafe compliance: The model answers prompts that should trigger a refusal.
Boundary miss: The model fails to detect that the request falls inside a restricted policy area.
Calibration tradeoff: Tightening safety can reduce underrefusal, but teams must watch for overrefusal too.
Evaluation driven: It is usually found through targeted refusal tests, red-teaming, and jailbreak benchmarks.
System-level issue: It often reflects the full stack, not just the base model.

Advantages of Underrefusal

There are no real product advantages to underrefusal itself, but understanding the failure mode helps teams improve safety and reliability.

Clearer safety specs: Teams can define which requests must be refused with more precision.
Better evaluation coverage: It motivates testing against harmful prompts, jailbreaks, and edge cases.
Stronger guardrails: It encourages adding policy checks, classifiers, and safer completions.
Improved trust: Reducing unsafe compliance helps users trust the system on sensitive requests.
More balanced behavior: It pushes teams toward a better tradeoff between helpfulness and safety.

Challenges in Underrefusal

Underrefusal is hard to eliminate because the safety boundary is contextual and can be subtle.

Ambiguous intent: Some prompts look harmless until the broader context is considered.
Prompt injection risk: Untrusted instructions can push the model into unsafe compliance.
Benchmark gaps: A model may pass a static test set and still fail in production.
Tuning tradeoffs: Safety tuning that is too aggressive can create the opposite problem, overrefusal.
Policy drift: Safety rules and acceptable-use standards change over time, so evaluation must stay current.

Example of Underrefusal in Action

Scenario: A user asks an assistant for step-by-step instructions that would help bypass a safety policy or generate harmful instructions.

If the model gives a direct answer instead of refusing and redirecting, that is underrefusal. A safer response would acknowledge the request boundary, decline the harmful part, and offer a benign alternative such as general safety guidance, policy-compliant troubleshooting, or high-level educational context.

In a production workflow, PromptLayer can help teams log these exchanges, tag unsafe-compliance cases, and compare prompt or model variants side by side. That makes it easier to spot where refusal behavior breaks down and to iterate on prompts, routing, or evaluation rules.

How PromptLayer Helps with Underrefusal

PromptLayer gives teams a place to trace model outputs, review refusal behavior, and run structured evaluations on prompts that touch safety boundaries. By combining logging, versioning, and feedback loops, the PromptLayer team helps you see where an assistant is complying when it should not, then tighten the workflow around those cases.

Ready to try it yourself? Sign up for PromptLayer and start managing your prompts in minutes.