Trojan Activation Attack: Red-Teaming Large Language Models using Activation Steering for Safety-Alignment

To ensure AI safety, instruction-tuned Large Language Models (LLMs) are specifically trained to ensure alignment, which refers to making models behave in accordance with human intentions. While these models have demonstrated commendable results on various safety benchmarks, the vulnerability of their safety alignment has not been extensively studied. This is particularly troubling given the potential harm that LLMs can inflict. Existing attack methods on LLMs often rely on poisoned training data or the injection of malicious prompts. These approaches compromise the stealthiness and generalizability of the attacks, making them susceptible to detection. Additionally, these models often demand substantial computational resources for implementation, making them less practical for real-world applications. In this work, we study a different attack scenario, called Trojan Activation Attack (TA^2), which injects trojan steering vectors into the activation layers of LLMs. These malicious steering vectors can be triggered at inference time to steer the models toward attacker-desired behaviors by manipulating their activations. Our experiment results on four primary alignment tasks show that TA^2 is highly effective and adds little or no overhead to attack efficiency. Additionally, we discuss potential countermeasures against such activation attacks.

翻译：为确保人工智能安全，经过指令调优的大语言模型（LLMs）会接受专门训练以实现对齐，即让模型行为符合人类意图。尽管这些模型在各种安全基准测试中展现出值得称道的效果，但其安全对齐的脆弱性尚未得到充分研究。考虑到大语言模型可能造成的潜在危害，这一问题尤为令人担忧。现有的大语言模型攻击方法通常依赖于投毒训练数据或注入恶意提示。这些方法会损害攻击的隐蔽性与泛化能力，使其易于被检测。此外，此类模型往往需要大量计算资源才能实施攻击，降低了实际应用的可行性。本研究探讨了一种称为特洛伊激活攻击（TA^2）的新型攻击场景，该方法将特洛伊引导向量注入大语言模型的激活层。这些恶意引导向量可在推理阶段被触发，通过操纵模型激活状态将模型行为导向攻击者预设的目标。我们在四项核心对齐任务上的实验结果表明，TA^2 具有极高攻击效能且几乎不产生额外攻击开销。此外，本文还探讨了针对此类激活攻击的潜在防御策略。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日