Large Language Models (LLMs) require careful safety alignment to prevent malicious outputs. While significant research focuses on mitigating harmful content generation, the enhanced safety often come with the side effect of over-refusal, where the LLMs may reject innocuous prompts and become less helpful. Although the issue of over-refusal has been empirically observed, a systematic measurement is challenging due to the difficulty of crafting prompts that appear harmful but are benign. This study proposes a novel method for automatically generating large-scale sets of ``seemingly toxic prompts'' (benign prompts likely rejected by LLMs). Leveraging this technique, we introduce OR-Bench, the first large-scale over-refusal benchmark. OR-Bench comprises 80,000 seemingly toxic prompts across 10 common rejection categories, a subset of around 1,000 hard prompts that are challenging even for state-of-the-art LLMs, and an additional 600 toxic prompts to prevent indiscriminate responses. We then conduct a comprehensive study to measure the over-refusal of 25 popular LLMs across 8 model families. Our datasets are available at https://huggingface.co/datasets/bench-llm/OR-Bench and the corresponding demo can be found at https://huggingface.co/spaces/bench-llm/or-bench. We hope this benchmark can help the community develop better safety aligned models.
翻译:大型语言模型(LLMs)需要谨慎的安全对齐以防止恶意输出。尽管大量研究聚焦于减少有害内容生成,但增强的安全性往往伴随着过度拒绝的副作用,即LLMs可能拒绝无害的提示并降低其帮助性。虽然过度拒绝问题已被经验性观察到,但由于难以构建看似有害实则良性的提示,系统化测量仍具挑战。本研究提出了一种自动生成大规模“看似有害提示”(可能被LLMs拒绝的良性提示)的新方法。基于该技术,我们推出了首个大规模过度拒绝基准OR-Bench。该基准包含10个常见拒绝类别下的80,000条看似有害提示、约1,000条即使对最先进LLMs也极具挑战性的困难提示子集,以及额外600条有害提示以防止模型 indiscriminate responses。随后我们对8个模型家族的25个主流LLMs进行了全面的过度拒绝测量研究。我们的数据集发布于https://huggingface.co/datasets/bench-llm/OR-Bench,相应演示可见https://huggingface.co/spaces/bench-llm/or-bench。我们希望此基准能助力社区开发更优的安全对齐模型。