EVOREFUSE: Evolutionary Prompt Optimization for Evaluation and Mitigation of LLM Over-Refusal to Pseudo-Malicious Instructions

Large language models (LLMs) frequently refuse to respond to pseudo-malicious instructions: semantically harmless input queries triggering unnecessary LLM refusals due to conservative safety alignment, significantly impairing user experience. Collecting such instructions is crucial for evaluating and mitigating over-refusals, but existing instruction curation methods, like manual creation or instruction rewriting, either lack scalability or fail to produce sufficiently diverse and effective refusal-inducing prompts. To address these limitations, we introduce EVOREFUSE, a prompt optimization approach that generates diverse pseudo-malicious instructions consistently eliciting confident refusals across LLMs. EVOREFUSE employs an evolutionary algorithm exploring the instruction space in more diverse directions than existing methods via mutation strategies and recombination, and iteratively evolves seed instructions to maximize evidence lower bound on LLM refusal probability. Using EVOREFUSE, we create two novel datasets: EVOREFUSE-TEST, a benchmark of 582 pseudo-malicious instructions that outperforms the next-best benchmark with 85.34% higher average refusal triggering rate across 9 LLMs without a safety-prior system prompt, 34.86% greater lexical diversity, and 40.03% improved LLM response confidence scores; and EVOREFUSE-ALIGN, which provides 3,000 pseudo-malicious instructions with responses for supervised and preference-based alignment training. With supervised fine-tuning on EVOREFUSE-ALIGN, LLAMA3.1-8B-INSTRUCT achieves up to 29.85% fewer over-refusals than models trained on the second-best alignment dataset, without compromising safety. Our analysis with EVOREFUSE-TEST reveals models trigger over-refusals by overly focusing on sensitive keywords while ignoring broader context. Our code and datasets are available at https://github.com/FishT0ucher/EVOREFUSE.

翻译：大型语言模型（LLM）经常拒绝响应伪恶意指令：这些语义无害的输入查询因保守的安全对齐而触发不必要的LLM拒绝，严重损害用户体验。收集此类指令对于评估和缓解过度拒绝至关重要，但现有的指令构建方法（如人工创建或指令重写）要么缺乏可扩展性，要么无法产生足够多样且有效的拒绝诱导提示。为应对这些局限性，我们提出了EVOREFUSE——一种提示优化方法，能够生成多样化的伪恶意指令，持续引发不同LLM的明确拒绝。EVOREFUSE采用进化算法，通过变异策略和重组操作，在比现有方法更多样化的方向上探索指令空间，并迭代进化种子指令以最大化LLM拒绝概率的证据下界。利用EVOREFUSE，我们构建了两个新颖数据集：EVOREFUSE-TEST（包含582条伪恶意指令的基准集），在9个未使用安全前置系统提示的LLM上，其平均拒绝触发率比次优基准高85.34%，词汇多样性提升34.86%，LLM响应置信度得分提高40.03%；以及EVOREFUSE-ALIGN（提供3000条带响应的伪恶意指令），可用于监督式与基于偏好的对齐训练。通过在EVOREFUSE-ALIGN上进行监督微调，LLAMA3.1-8B-INSTRUCT相较于使用次优对齐数据集训练的模型，过度拒绝减少达29.85%，且不影响安全性。我们利用EVOREFUSE-TEST的分析表明，模型因过度关注敏感关键词而忽略整体语境，从而触发过度拒绝。代码与数据集已开源：https://github.com/FishT0ucher/EVOREFUSE。