Health-ORSC-Bench: A Benchmark for Measuring Over-Refusal and Safety Completion in Health Context

Safety alignment in Large Language Models is critical for healthcare; however, reliance on binary refusal boundaries often results in \emph{over-refusal} of benign queries or \emph{unsafe compliance} with harmful ones. While existing benchmarks measure these extremes, they fail to evaluate Safe Completion: the model's ability to maximise helpfulness on dual-use or borderline queries by providing safe, high-level guidance without crossing into actionable harm. We introduce \textbf{Health-ORSC-Bench}, the first large-scale benchmark designed to systematically measure \textbf{Over-Refusal} and \textbf{Safe Completion} quality in healthcare. Comprising 31,920 benign boundary prompts across seven health categories (e.g., self-harm, medical misinformation), our framework uses an automated pipeline with human validation to test models at varying levels of intent ambiguity. We evaluate 30 state-of-the-art LLMs, including GPT-5 and Claude-4, revealing a significant tension: safety-optimised models frequently refuse up to 80\% of "Hard" benign prompts, while domain-specific models often sacrifice safety for utility. Our findings demonstrate that model family and size significantly influence calibration: larger frontier models (e.g., GPT-5, Llama-4) exhibit "safety-pessimism" and higher over-refusal than smaller or MoE-based counterparts (e.g., Qwen-3-Next), highlighting that current LLMs struggle to balance refusal and compliance. Health-ORSC-Bench provides a rigorous standard for calibrating the next generation of medical AI assistants toward nuanced, safe, and helpful completions. The code and data will be released upon acceptance. \textcolor{red}{Warning: Some contents may include toxic or undesired contents.}

翻译：大型语言模型的安全对齐在医疗保健领域至关重要；然而，对二元拒绝边界的依赖常常导致对良性查询的\emph{过度拒绝}或对有害查询的\emph{不安全遵从}。现有基准虽然能衡量这些极端情况，却未能评估安全完成度：即模型在面对双重用途或边界模糊查询时，通过提供安全、高层次的指导而不跨越可操作伤害的界限，从而最大化帮助性的能力。我们引入了\textbf{Health-ORSC-Bench}，这是首个旨在系统化衡量医疗保健领域中\textbf{过度拒绝}与\textbf{安全完成度}质量的大规模基准。该框架包含七个健康类别（例如自残、医疗错误信息）下的31,920个良性边界提示，并采用自动化流程结合人工验证，以在不同意图模糊程度上测试模型。我们评估了30个最先进的大型语言模型，包括GPT-5和Claude-4，揭示了一个显著的矛盾：安全优化的模型经常拒绝高达80\%的“困难”良性提示，而特定领域的模型则常常为了实用性而牺牲安全性。我们的研究结果表明，模型系列和规模显著影响校准：更大的前沿模型（例如GPT-5、Llama-4）表现出“安全悲观主义”和比小型或基于混合专家模型（例如Qwen-3-Next）更高的过度拒绝率，这突显了当前大型语言模型在平衡拒绝与遵从方面存在困难。Health-ORSC-Bench为校准下一代医疗人工智能助手，使其能够提供细致、安全且有益的完成内容，提供了一个严格的标准。代码和数据将在论文被接受后发布。\textcolor{red}{警告：部分内容可能包含有害或不期望的内容。}