Large Language Models (LLMs) are increasingly deployed in agentic systems that interact with an external environment; this makes them susceptible to prompt injections when dealing with untrusted data. To overcome this limitation, we propose SIC (Soft Instruction Control)-a simple yet effective iterative prompt sanitization loop designed for tool-augmented LLM agents. Our method repeatedly inspects incoming data for instructions that could compromise agent behavior. If such content is found, the malicious content is rewritten, masked, or removed, and the result is re-evaluated. The process continues until the input is clean or a maximum iteration limit is reached; if imperative instruction-like content remains, the agent halts to ensure security. By allowing multiple passes, our approach acknowledges that individual rewrites may fail but enables the system to catch and correct missed injections in later steps. Although immediately useful, worst-case analysis shows that SIC is not infallible; strong adversary can still get a 15% ASR by embedding non-imperative workflows. This nonetheless raises the bar.
翻译:大型语言模型(LLM)越来越多地被部署在与外部环境交互的智能体系统中;这使得它们在处理不可信数据时容易受到提示注入攻击。为克服这一局限,我们提出了SIC(软指令控制)——一种为工具增强型LLM智能体设计的简单而有效的迭代提示净化循环方法。我们的方法会反复检查输入数据中是否存在可能危害智能体行为的指令。若发现此类内容,则对恶意内容进行重写、掩码或删除,并对结果进行重新评估。该过程持续进行,直至输入内容被净化或达到最大迭代次数限制;若仍存在强制性的类指令内容,智能体将停止运行以确保安全。通过允许多轮处理,我们的方法承认单次重写可能失败,但能使系统在后续步骤中捕获并纠正遗漏的注入攻击。尽管该方法具有即时实用性,但最坏情况分析表明SIC并非绝对可靠;强大的攻击者仍可通过嵌入非强制性工作流程实现15%的攻击成功率。尽管如此,该方法显著提高了攻击门槛。