Warning: This article includes red-teaming experiments, which contain examples of compromised LLM responses that may be offensive or upsetting. Large Language Models (LLMs) have the potential to create harmful content, such as generating sophisticated phishing emails and assisting in writing code of harmful computer viruses. Thus, it is crucial to ensure their safe and responsible response generation. To reduce the risk of generating harmful or irresponsible content, researchers have developed techniques such as reinforcement learning with human feedback to align LLM's outputs with human values and preferences. However, it is still undetermined whether such measures are sufficient to prevent LLMs from generating interesting responses. In this study, we propose Amnesia, a lightweight activation-space adversarial attack that manipulates internal transformer states to bypass existing safety mechanisms in open-weight LLMs. Through experimental analysis on state-of-the-art, open-weight LLMs, we demonstrate that our attack effectively circumvents existing safeguards, enabling the generation of harmful content without the need for any fine-tuning or additional training. Our experiments on benchmark datasets show that the proposed attack can induce various antisocial behaviors in LLMs. These findings highlight the urgent need for more robust security measures in open-weight LLMs and underscore the importance of continued research to prevent their potential misuse.
翻译:警告:本文包含红队测试实验,其中包含可能具有冒犯性或令人不适的受损LLM响应示例。大型语言模型(LLMs)具有生成有害内容的潜力,例如制作复杂的钓鱼邮件以及协助编写有害计算机病毒的代码。因此,确保其生成安全且负责任的响应至关重要。为降低生成有害或不负责内容的风险,研究人员已开发出诸如基于人类反馈的强化学习等技术,以使LLM的输出与人类价值观和偏好对齐。然而,此类措施是否足以防止LLMs生成有害响应仍不确定。在本研究中,我们提出遗忘(Amnesia),一种轻量级的激活空间对抗攻击方法,通过操纵内部Transformer状态来绕过开源权重LLMs中现有的安全机制。通过对先进开源权重LLMs的实验分析,我们证明该攻击能有效规避现有防护机制,无需任何微调或额外训练即可生成有害内容。我们在基准数据集上的实验表明,所提出的攻击可诱导LLMs产生多种反社会行为。这些发现凸显了开源权重LLMs对更鲁棒安全措施的迫切需求,并强调了持续开展研究以防止其潜在滥用的重要性。