Large Language Models(LLMs) have been successful in numerous fields. Alignment has usually been applied to prevent them from harmful purposes. However, aligned LLMs remain vulnerable to jailbreak attacks that deliberately mislead them into producing harmful outputs. Existing jailbreaks are either black-box, using carefully crafted, unstealthy prompts, or white-box, requiring resource-intensive computation. In light of these challenges, we introduce ShallowJail, a novel attack that exploits shallow alignment in LLMs. ShallowJail can misguide LLMs' responses by manipulating the initial tokens during inference. Through extensive experiments, we demonstrate the effectiveness of ShallowJail, which substantially degrades the safety of state-of-the-art LLM responses. Our code is available at https://github.com/liuup/ShallowJail.
翻译:大型语言模型(LLMs)已在众多领域取得成功。通常通过对齐技术来防止其被用于有害目的。然而,经过对齐的LLMs仍易受越狱攻击的影响,这些攻击会故意误导模型产生有害输出。现有的越狱方法要么是基于黑盒的、使用精心设计但隐蔽性差的提示,要么是基于白盒的、需要大量计算资源。针对这些挑战,我们提出了ShallowJail——一种利用LLMs浅层对齐特性的新型攻击方法。ShallowJail通过在推理过程中操纵初始令牌来误导LLMs的响应。通过大量实验,我们证明了ShallowJail的有效性,它能显著降低当前最先进LLMs响应的安全性。我们的代码发布于https://github.com/liuup/ShallowJail。