Large language model (LLM)-integrated applications have become increasingly prevalent, yet face critical security vulnerabilities from prompt injection (PI) attacks. Defending against PI attacks faces two major issues: malicious instructions can be injected through diverse vectors, and injected instructions often lack clear semantic boundaries from the surrounding context, making them difficult to identify. To address these issues, we propose InstruCoT, a model enhancement method for PI defense that synthesizes diverse training data and employs instruction-level chain-of-thought fine-tuning, enabling LLMs to effectively identify and reject malicious instructions regardless of their source or position in the context. We evaluate InstruCoT across three critical dimensions: Behavior Deviation, Privacy Leakage, and Harmful Output. Experimental results across four LLMs demonstrate that InstruCoT significantly outperforms baselines in all dimensions while maintaining utility performance without degradation
翻译:大语言模型(LLM)集成应用日益普及,却面临提示注入(PI)攻击带来的严重安全漏洞。防御PI攻击存在两大难题:恶意指令可通过多样化载体注入,且注入指令常与上下文缺乏清晰的语义边界,导致难以识别。为应对这些问题,我们提出InstruCoT——一种用于PI防御的模型增强方法,该方法通过合成多样化训练数据并采用指令级思维链微调技术,使LLM能够有效识别并拒绝恶意指令,无论其来源或上下文位置如何。我们从行为偏差、隐私泄露和有害输出三个关键维度评估InstruCoT。在四个LLM上的实验结果表明,InstruCoT在所有维度上均显著优于基线方法,同时保持原有性能水平不下降。