While safety alignment and guardrails help large language models (LLMs) avoid harmful outputs, they can also induce overrefusal, i.e., unwarranted rejection of benign queries that merely appear risky. We present DDOR (Delta Debugging for OverRefusal), a fully automated and explainable framework for overrefusal testing and repair in a black-box setting, where only model inputs and outputs are accessible and internal safety mechanisms remain opaque. DDOR applies delta debugging to localize minimal refusal-triggering fragments (mRTFs) that provide phrase-level, explainable evidence for why a refusal occurs. Conditioned on these mRTFs, DDOR generates diverse, context-rich prompts and performs multi-oracle validation to filter intrinsically unsafe or ambiguous cases, producing scalable and model-specific overrefusal test suites (approximately 1K cases per model). Beyond evaluation, we further leverage localized mRTFs to perform targeted prompt repair, substantially reducing overrefusal while preserving the original intent and maintaining safety on genuinely harmful inputs. Overall, DDOR offers a practical end-to-end solution to both evaluate and mitigate overrefusal, improving LLM usability without sacrificing safety.
翻译:尽管安全对齐与护栏机制有助于大型语言模型(LLMs)避免产生有害输出,但这些机制可能引发过度拒答(overrefusal)现象——即对仅表面存在风险的无害查询做出不合理拒绝。本文提出DDOR(面向过度拒答的增量调试方法)框架,该框架可在仅访问模型输入输出、内部安全机制不透明的黑盒环境下,实现完全自动化且可解释的过度拒答测试与修复。DDOR方法运用增量调试技术定位最小拒答触发片段(mRTFs),为拒答原因提供短语级可解释证据。基于这些mRTFs,DDOR生成多样化、上下文丰富的提示,并通过多断言验证过滤内在不安全或模棱两可的案例,构建可扩展的模型专用过度拒答测试套件(每模型约千例)。除评估外,我们进一步利用定位的mRTFs实施精准提示修复,在保持原始查询意图与有害输入安全性的前提下大幅降低过度拒答率。总体而言,DDOR为评估与缓解过度拒答提供了实用的端到端解决方案,在不牺牲安全性前提下提升LLM的可用性。