Today's LLMs are susceptible to prompt injections, jailbreaks, and other attacks that allow adversaries to overwrite a model's original instructions with their own malicious prompts. In this work, we argue that one of the primary vulnerabilities underlying these attacks is that LLMs often consider system prompts (e.g., text from an application developer) to be the same priority as text from untrusted users and third parties. To address this, we propose an instruction hierarchy that explicitly defines how models should behave when instructions of different priorities conflict. We then propose a data generation method to demonstrate this hierarchical instruction following behavior, which teaches LLMs to selectively ignore lower-privileged instructions. We apply this method to GPT-3.5, showing that it drastically increases robustness -- even for attack types not seen during training -- while imposing minimal degradations on standard capabilities.
翻译:当今的大语言模型容易受到提示注入、越狱攻击及其他恶意手段的影响,使得攻击者能用自身恶意提示覆盖模型的原始指令。本研究认为,这类攻击的主要脆弱性根源之一在于,大语言模型通常将系统提示(例如应用开发者提供的文本)与不可信用户及第三方文本视为同等优先级。为解决此问题,我们提出一种指令层次结构,明确定义了当不同优先级的指令发生冲突时模型的行为准则。继而提出一种数据生成方法,通过演示分层指令遵循行为,训练大语言模型选择性忽略低权限指令。我们将该方法应用于GPT-3.5,结果表明其在显著增强鲁棒性的同时(即使对训练中未见过的攻击类型也有效),对标准能力的损害极小。