The integration of large language models (LLMs) with external content has enabled more up-to-date and wide-ranging applications of LLMs, such as Microsoft Copilot. However, this integration has also exposed LLMs to the risk of indirect prompt injection attacks, where an attacker can embed malicious instructions within external content, compromising LLM output and causing responses to deviate from user expectations. To investigate this important but underexplored issue, we introduce the first benchmark for indirect prompt injection attacks, named BIPIA, to evaluate the risk of such attacks. Based on the evaluation, our work makes a key analysis of the underlying reason for the success of the attack, namely the inability of LLMs to distinguish between instructions and external content and the absence of LLMs' awareness to not execute instructions within external content. Building upon this analysis, we develop two black-box methods based on prompt learning and a white-box defense method based on fine-tuning with adversarial training accordingly. Experimental results demonstrate that black-box defenses are highly effective in mitigating these attacks, while the white-box defense reduces the attack success rate to near-zero levels. Overall, our work systematically investigates indirect prompt injection attacks by introducing a benchmark, analyzing the underlying reason for the success of the attack, and developing an initial set of defenses.
翻译:大型语言模型(LLMs)与外部内容的集成使其应用(如Microsoft Copilot)更加广泛且实时更新。然而,这种集成也使LLMs面临间接提示注入攻击的风险——攻击者可将恶意指令嵌入外部内容,从而篡改LLM输出,导致模型响应偏离用户预期。为系统探究这一重要但尚未充分研究的问题,我们首次构建了间接提示注入攻击基准BIPIA(Benchmark for Indirect Prompt Injection Attacks),用于评估此类攻击风险。基于评估结果,本研究关键分析了攻击成功的根本原因:LLMs无法区分指令与外部内容,且缺乏不执行外部内容中指令的认知能力。基于该分析,我们分别开发了两种基于提示学习的黑盒防御方法和一种基于对抗训练微调的白盒防御方法。实验结果表明,黑盒防御能高效缓解此类攻击,而白盒防御可将攻击成功率降至接近零水平。总体而言,本研究通过构建基准、分析攻击成功根本原因及开发初始防御体系,系统性探究了间接提示注入攻击问题。