This paper proposes a simple yet effective jailbreak attack named FlipAttack against black-box LLMs. First, from the autoregressive nature, we reveal that LLMs tend to understand the text from left to right and find that they struggle to comprehend the text when noise is added to the left side. Motivated by these insights, we propose to disguise the harmful prompt by constructing left-side noise merely based on the prompt itself, then generalize this idea to 4 flipping modes. Second, we verify the strong ability of LLMs to perform the text-flipping task, and then develop 4 variants to guide LLMs to denoise, understand, and execute harmful behaviors accurately. These designs keep FlipAttack universal, stealthy, and simple, allowing it to jailbreak black-box LLMs within only 1 query. Experiments on 8 LLMs demonstrate the superiority of FlipAttack. Remarkably, it achieves $\sim$98\% attack success rate on GPT-4o, and $\sim$98\% bypass rate against 5 guardrail models on average. The codes are available at GitHub\footnote{https://github.com/yueliu1999/FlipAttack}.
翻译:本文提出了一种针对黑盒大语言模型的简单而有效的越狱攻击方法,命名为FlipAttack。首先,基于自回归特性,我们揭示了大语言模型倾向于从左到右理解文本,并发现在文本左侧添加噪声时,模型难以正确理解内容。受此启发,我们提出仅基于提示本身构建左侧噪声来伪装有害提示,并将这一思想推广到4种翻转模式。其次,我们验证了大语言模型在执行文本翻转任务方面的强大能力,并据此开发了4种变体,以引导大语言模型准确地去噪、理解并执行有害行为。这些设计使得FlipAttack具有通用性、隐蔽性和简洁性,仅需1次查询即可实现对黑盒大语言模型的越狱。在8个大语言模型上的实验证明了FlipAttack的优越性。值得注意的是,它在GPT-4o上实现了约98%的攻击成功率,并在平均意义上对5个防护模型达到了约98%的绕过率。相关代码已在GitHub上开源\footnote{https://github.com/yueliu1999/FlipAttack}。