The integration of large language models (LLMs) with external content has enabled more up-to-date and wide-ranging applications of LLMs, such as Microsoft Copilot. However, this integration has also exposed LLMs to the risk of indirect prompt injection attacks, where an attacker can embed malicious instructions within external content, compromising LLM output and causing responses to deviate from user expectations. To investigate this important but underexplored issue, we introduce the first benchmark for indirect prompt injection attacks, named BIPIA, to evaluate the risk of such attacks. Based on the evaluation, our work makes a key analysis of the underlying reason for the success of the attack, namely the inability of LLMs to distinguish between instructions and external content and the absence of LLMs' awareness to not execute instructions within external content. Building upon this analysis, we develop two black-box methods based on prompt learning and a white-box defense method based on fine-tuning with adversarial training accordingly. Experimental results demonstrate that black-box defenses are highly effective in mitigating these attacks, while the white-box defense reduces the attack success rate to near-zero levels. Overall, our work systematically investigates indirect prompt injection attacks by introducing a benchmark, analyzing the underlying reason for the success of the attack, and developing an initial set of defenses.
翻译:大型语言模型(LLM)与外部内容的集成使其应用范围更广、时效性更强(例如Microsoft Copilot)。然而,这种集成也使LLM面临间接提示注入攻击的风险——攻击者可将恶意指令嵌入外部内容中,从而破坏LLM的输出,导致响应偏离用户预期。为探究这一重要但尚未充分研究的问题,我们首次提出针对间接提示注入攻击的基准测试BIPIA,用于评估此类攻击的风险。基于评估结果,我们的工作对攻击成功的内在原因进行了关键性分析,即LLM无法区分指令与外部内容,且缺乏不执行外部内容中指令的认知能力。基于这一分析,我们分别开发了基于提示学习的两种黑盒防御方法,以及基于对抗训练微调的白盒防御方法。实验结果表明,黑盒防御在缓解此类攻击方面非常有效,而白盒防御可将攻击成功率降至接近零的水平。总体而言,我们的工作通过提出基准测试、分析攻击成功的根本原因并开发初步防御体系,系统性地研究了间接提示注入攻击。