Large Language Models (LLMs) are susceptible to Jailbreaking attacks, which aim to extract harmful information by subtly modifying the attack query. As defense mechanisms evolve, directly obtaining harmful information becomes increasingly challenging for Jailbreaking attacks. In this work, inspired by human practices of indirect context to elicit harmful information, we focus on a new attack form called Contextual Interaction Attack. The idea relies on the autoregressive nature of the generation process in LLMs. We contend that the prior context--the information preceding the attack query--plays a pivotal role in enabling potent Jailbreaking attacks. Specifically, we propose an approach that leverages preliminary question-answer pairs to interact with the LLM. By doing so, we guide the responses of the model toward revealing the 'desired' harmful information. We conduct experiments on four different LLMs and demonstrate the efficacy of this attack, which is black-box and can also transfer across LLMs. We believe this can lead to further developments and understanding of the context vector in LLMs.
翻译:大型语言模型(LLMs)容易受到越狱攻击,此类攻击旨在通过巧妙修改攻击查询来提取有害信息。随着防御机制的不断演进,越狱攻击直接获取有害信息变得愈发困难。受人类通过间接上下文诱导有害信息这一实践的启发,本研究聚焦于一种新型攻击形式——上下文交互攻击。该思想基于LLMs生成过程中的自回归特性。我们认为,攻击查询的前置上下文——即位于攻击查询之前的信息——在实现有效越狱攻击中发挥着关键作用。具体而言,我们提出一种利用初步问答对与LLM交互的方法。通过此种方式,我们引导模型响应逐步暴露"期望的"有害信息。我们在四种不同的LLM上进行了实验,证明了这种攻击的有效性,且该攻击具有黑盒性质,并能在不同LLM之间迁移。我们相信,这将促进对LLM中上下文向量的进一步开发与理解。