Large language models (LLMs) are initially trained on vast amounts of data, then fine-tuned using reinforcement learning from human feedback (RLHF); this also serves to teach the LLM to provide appropriate and safe responses. In this paper, we present a novel method to manipulate the fine-tuned version into reverting to its pre-RLHF behavior, effectively erasing the model's filters; the exploit currently works for GPT4, Claude Sonnet, and (to some extent) for Inflection-2.5. Unlike other jailbreaks (for example, the popular "Do Anything Now" (DAN) ), our method does not rely on instructing the LLM to override its RLHF policy; hence, simply modifying the RLHF process is unlikely to address it. Instead, we induce a hallucination involving reversed text during which the model reverts to a word bucket, effectively pausing the model's filter. We believe that our exploit presents a fundamental vulnerability in LLMs currently unaddressed, as well as an opportunity to better understand the inner workings of LLMs during hallucinations.
翻译:大型语言模型(LLMs)最初在大量数据上训练,随后通过基于人类反馈的强化学习(RLHF)进行微调;这一过程也用于教会LLM提供恰当且安全的响应。本文提出了一种新颖方法,将微调后的模型操纵回其RLHF前的行为,从而有效擦除模型的过滤器;该漏洞目前适用于GPT4、Claude Sonnet,并在一定程度上适用于Inflection-2.5。与其他越狱方法(例如流行的"现在做任何事"(DAN))不同,我们的方法不依赖于指示LLM覆盖其RLHF策略;因此,单纯修改RLHF过程不太可能解决此问题。相反,我们诱导涉及反向文本的幻觉,在此期间模型回退到单词存储桶,从而有效暂停模型的过滤器。我们认为,我们的漏洞揭示了LLM中目前尚未解决的根本性弱点,同时也为更好地理解LLM在幻觉过程中的内部工作机制提供了机会。