To standardize interactions between LLM-based agents and their environments, the Model Context Protocol (MCP) was proposed and has since been widely adopted. However, integrating external tools expands the attack surface, exposing agents to tool poisoning attacks. In such attacks, malicious instructions embedded in tool metadata are injected into the agent context during MCP registration phase, thereby manipulating agent behavior. Prior work primarily focuses on explicit tool poisoning or relied on manually crafted poisoned tools. In contrast, we focus on a particularly stealthy variant: implicit tool poisoning, where the poisoned tool itself remains uninvoked. Instead, the instructions embedded in the tool metadata induce the agent to invoke a legitimate but high-privilege tool to perform malicious operations. We propose MCP-ITP, the first automated and adaptive framework for implicit tool poisoning within the MCP ecosystem. MCP-ITP formulates poisoned tool generation as a black-box optimization problem and employs an iterative optimization strategy that leverages feedback from both an evaluation LLM and a detection LLM to maximize Attack Success Rate (ASR) while evading current detection mechanisms. Experimental results on the MCPTox dataset across 12 LLM agents demonstrate that MCP-ITP consistently outperforms the manually crafted baseline, achieving up to 84.2% ASR while suppressing the Malicious Tool Detection Rate (MDR) to as low as 0.3%.
翻译:为标准化基于LLM的智能体与其环境间的交互,模型上下文协议(MCP)被提出并已得到广泛采用。然而,集成外部工具扩展了攻击面,使智能体面临工具投毒攻击。在此类攻击中,嵌入工具元数据中的恶意指令会在MCP注册阶段被注入智能体上下文,从而操纵智能体行为。先前工作主要关注显式工具投毒或依赖人工构造的投毒工具。相比之下,我们关注一种特别隐蔽的变体:隐式工具投毒,其中被投毒的工具本身未被调用,而是嵌入工具元数据中的指令诱导智能体调用一个合法但高权限的工具来执行恶意操作。我们提出了MCP-ITP,这是首个针对MCP生态系统内隐式工具投毒的自动化自适应框架。MCP-ITP将投毒工具生成建模为一个黑盒优化问题,并采用一种迭代优化策略,该策略利用评估LLM和检测LLM的反馈,以在规避现有检测机制的同时最大化攻击成功率(ASR)。在MCPTox数据集上对12个LLM智能体进行的实验结果表明,MCP-ITP持续优于人工构造的基线方法,最高可实现84.2%的ASR,同时将恶意工具检测率(MDR)抑制至低至0.3%。