Despite their remarkable performance, large language models lack elementary safety features, making them susceptible to numerous malicious attacks. In particular, previous work has identified the absence of an intrinsic separation between instructions and data as the root cause of the success of prompt injection attacks. In this work, we propose a new architectural element, ASIDE, that allows language models to clearly separate instructions and data at the level of token embeddings. ASIDE applies an orthogonal rotation to the embeddings of data tokens, thus creating clearly distinct representations of instructions and data tokens without introducing any additional parameters. As we demonstrate experimentally across a range of models, instruction-tuning LLMs with ASIDE (1) achieves substantially higher instruction-data separation without performance loss and (2) makes the models more robust to prompt injection benchmarks, even without dedicated safety training. Additionally, we provide insights into the mechanism underlying our method through an analysis of the model representations. The source code and training scripts are openly accessible at https://github.com/egozverev/aside.
翻译:尽管大型语言模型展现出卓越的性能,但其缺乏基础的安全特性,使其易受多种恶意攻击。特别是,先前研究已指出指令与数据之间缺乏内在分离是提示注入攻击成功的根本原因。在本研究中,我们提出了一种新的架构组件——ASIDE,它使语言模型能够在词元嵌入层面清晰分离指令与数据。ASIDE对数据词元的嵌入施加正交旋转,从而在不引入任何额外参数的情况下创建指令词元与数据词元的明显区分表征。正如我们在多个模型上的实验所证明,采用ASIDE进行指令微调的大型语言模型能够:(1)在保持性能无损的前提下实现显著更高的指令-数据分离度;(2)即使未进行专门的安全训练,也能使模型对提示注入基准测试表现出更强的鲁棒性。此外,我们通过对模型表征的分析揭示了该方法的底层机制。源代码与训练脚本已公开于 https://github.com/egozverev/aside。