Instruction-tuned Large Language Models (LLMs) have achieved breakthrough results, opening countless new possibilities for many practical applications. However, LLMs lack elementary safety features that are established norms in other areas of computer science, such as the separation between instructions and data, causing them to malfunction or rendering them vulnerable to manipulation and interference by third parties e.g., via indirect prompt/command injection. Even worse, so far, there is not even an established definition of what precisely such a separation would mean and how its violation could be tested. In this work, we aim to close this gap. We introduce a formal measure to quantify the phenomenon of instruction-data separation as well as an empirical variant of the measure that can be computed from a model`s black-box outputs. We also introduce a new dataset, SEP (Should it be Executed or Processed?), which allows estimating the measure, and we report results on several state-of-the-art open-source and closed LLMs. Finally, we quantitatively demonstrate that all evaluated LLMs fail to achieve a high amount of separation, according to our measure. The source code and SEP dataset are openly accessible at https://github.com/egozverev/Shold-It-Be-Executed-Or-Processed.
翻译:指令微调的大型语言模型(LLMs)取得了突破性成果,为众多实际应用开辟了无数新可能。然而,这些模型缺乏计算机科学其他领域公认的基本安全特性(例如指令与数据的分离),导致其运行异常,或易受第三方操纵与干扰(例如通过间接提示/命令注入)。更严峻的是,目前甚至缺乏对这种分离的明确定义,以及对其违反行为的测试方法。本研究旨在填补这一空白。我们提出了一个量化指令-数据分离现象的形式化度量标准,以及一个可从模型黑箱输出中计算的经验性度量变体。我们还引入了一个新数据集SEP(Should it be Executed or Processed?),该数据集可用于估算该度量标准,并报告了多个最先进的开源与闭源LLMs的测试结果。最后,我们通过定量分析证明:根据我们的度量标准,所有被评估的LLMs均未能实现高水平的指令-数据分离。源代码和SEP数据集已公开于 https://github.com/egozverev/Shold-It-Be-Executed-Or-Processed。