Large Language Models (LLMs), while powerful, are built and trained to process a single text input. In common applications, multiple inputs can be processed by concatenating them together into a single stream of text. However, the LLM is unable to distinguish which sections of prompt belong to various input sources. Indirect prompt injection attacks take advantage of this vulnerability by embedding adversarial instructions into untrusted data being processed alongside user commands. Often, the LLM will mistake the adversarial instructions as user commands to be followed, creating a security vulnerability in the larger system. We introduce spotlighting, a family of prompt engineering techniques that can be used to improve LLMs' ability to distinguish among multiple sources of input. The key insight is to utilize transformations of an input to provide a reliable and continuous signal of its provenance. We evaluate spotlighting as a defense against indirect prompt injection attacks, and find that it is a robust defense that has minimal detrimental impact to underlying NLP tasks. Using GPT-family models, we find that spotlighting reduces the attack success rate from greater than {50}\% to below {2}\% in our experiments with minimal impact on task efficacy.
翻译:大型语言模型(LLMs)虽然功能强大,但构建和训练使其仅能处理单一文本输入。在常见应用中,通过将多个输入拼接成单一文本流进行处理。然而,LLM无法区分提示中哪些部分属于不同输入来源。间接提示注入攻击利用这一漏洞,将对抗性指令嵌入到与用户命令一同处理的不可信数据中。通常,LLM会误将对抗性指令视为需执行的用户命令,从而在整体系统中产生安全漏洞。我们提出"聚光灯技术"(spotlighting),这是一类用于提升LLM区分多输入来源能力的提示工程方法。其核心思路是利用输入变换来提供可靠且持续的来源可溯信号。我们评估了聚光灯技术作为间接提示注入攻击防御手段的效果,发现该技术是一种鲁棒性防御,且对底层NLP任务的负面影响极小。基于GPT系列模型的实验表明,聚光灯技术在任务效能几乎不受影响的情况下,将攻击成功率从高于50%降低至低于2%。