The widespread adoption of large language models (LLMs) has raised concerns about their safety and reliability, particularly regarding their vulnerability to adversarial attacks. In this paper, we propose a novel perspective that attributes this vulnerability to reward misspecification during the alignment process. We introduce a metric ReGap to quantify the extent of reward misspecification and demonstrate its effectiveness and robustness in detecting harmful backdoor prompts. Building upon these insights, we present ReMiss, a system for automated red teaming that generates adversarial prompts against various target aligned LLMs. ReMiss achieves state-of-the-art attack success rates on the AdvBench benchmark while preserving the human readability of the generated prompts. Detailed analysis highlights the unique advantages brought by the proposed reward misspecification objective compared to previous methods.
翻译:大型语言模型(LLM)的广泛采用引发了对其安全性和可靠性的担忧,尤其是其面对对抗性攻击的脆弱性。本文提出一种新颖的视角,将这种脆弱性归因于对齐过程中的奖励函数设定失准。我们引入ReGap这一指标来量化奖励设定失准的程度,并证明其在检测有害后门提示方面的有效性和鲁棒性。基于这些发现,我们提出了ReMiss——一个针对各类目标对齐LLM生成对抗性提示的自动化红队测试系统。ReMiss在AdvBench基准测试中取得了最先进的攻击成功率,同时保持了生成提示的人类可读性。详细分析突显了所提出的奖励设定失准目标相较于先前方法的独特优势。