Current studies have exposed the risk of Large Language Models (LLMs) generating harmful content by jailbreak attacks. However, they overlook that the direct generation of harmful content from scratch is more difficult than inducing LLM to calibrate benign content into harmful forms. In our study, we introduce a novel attack framework that exploits AdVersArial meTAphoR (AVATAR) to induce the LLM to calibrate malicious metaphors for jailbreaking. Specifically, to answer harmful queries, AVATAR adaptively identifies a set of benign but logically related metaphors as the initial seed. Then, driven by these metaphors, the target LLM is induced to reason and calibrate about the metaphorical content, thus jailbroken by either directly outputting harmful responses or calibrating residuals between metaphorical and professional harmful content. Experimental results demonstrate that AVATAR can effectively and transferable jailbreak LLMs and achieve a state-of-the-art attack success rate across multiple advanced LLMs.
翻译:当前研究已揭示大型语言模型(LLM)通过越狱攻击生成有害内容的风险。然而,现有研究忽略了直接从头生成有害内容比诱导LLM将良性内容校准为有害形式更为困难。在本研究中,我们提出了一种新颖的攻击框架,利用对抗性隐喻(AVATAR)诱导LLM校准恶意隐喻以实现越狱。具体而言,为响应有害查询,AVATAR自适应地识别一组良性但逻辑相关的隐喻作为初始种子。随后,在这些隐喻的驱动下,目标LLM被诱导对隐喻内容进行推理与校准,从而通过直接输出有害响应或校准隐喻内容与专业有害内容之间的残差来实现越狱。实验结果表明,AVATAR能够有效且可迁移地越狱多种先进LLM,并在多个先进LLM上达到最先进的攻击成功率。