Recently, Large language models (LLMs) with powerful general capabilities have been increasingly integrated into various Web applications, while undergoing alignment training to ensure that the generated content aligns with user intent and ethics. Unfortunately, they remain the risk of generating harmful content like hate speech and criminal activities in practical applications. Current approaches primarily rely on detecting, collecting, and training against harmful prompts to prevent such risks. However, they typically focused on the "superficial" harmful prompts with a solitary intent, ignoring composite attack instructions with multiple intentions that can easily elicit harmful content in real-world scenarios. In this paper, we introduce an innovative technique for obfuscating harmful instructions: Compositional Instruction Attacks (CIA), which refers to attacking by combination and encapsulation of multiple instructions. CIA hides harmful prompts within instructions of harmless intentions, making it impossible for the model to identify underlying malicious intentions. Furthermore, we implement two transformation methods, known as T-CIA and W-CIA, to automatically disguise harmful instructions as talking or writing tasks, making them appear harmless to LLMs. We evaluated CIA on GPT-4, ChatGPT, and ChatGLM2 with two safety assessment datasets and two harmful prompt datasets. It achieves an attack success rate of 95%+ on safety assessment datasets, and 83%+ for GPT-4, 91%+ for ChatGPT (gpt-3.5-turbo backed) and ChatGLM2-6B on harmful prompt datasets. Our approach reveals the vulnerability of LLMs to such compositional instruction attacks that harbor underlying harmful intentions, contributing significantly to LLM security development. Warning: this paper may contain offensive or upsetting content!
翻译:近期,具有强大通用能力的大语言模型在经历对齐训练以确保生成内容符合用户意图和伦理规范的同时,已逐渐集成到各类网络应用中。然而,在实际应用中,它们仍存在生成仇恨言论、犯罪行为等有害内容的风险。现有方法主要通过检测、收集和对抗训练有害提示词来规避此类风险,但通常仅关注表面单一意图的有害提示词,忽略了现实场景中易触发有害内容的复合意图攻击指令。本文提出一种创新性有害指令混淆技术——组合指令攻击(CIA),通过多个指令的组合与封装实现攻击。该技术将有害提示词隐藏于无害意图的指令中,使模型难以识别潜在恶意意图。我们进一步实现了两种转换方法(T-CIA和W-CIA),可自动将有害指令伪装为对话或写作任务,使其对大语言模型呈现无害表象。我们在GPT-4、ChatGPT和ChatGLM2上使用两组安全评估数据集和两组有害提示词数据集进行评估:安全评估数据集攻击成功率达95%以上;有害提示词数据集中,针对GPT-4达83%以上,对ChatGPT(基于gpt-3.5-turbo)和ChatGLM2-6B均达91%以上。本研究揭示了语言模型面对此类包含潜在恶意意图的组合指令攻击时的脆弱性,为语言模型安全发展提供重要支撑。警告:本文可能包含冒犯性或令人不适的内容!