MalTool: Malicious Tool Attacks on LLM Agents

In a malicious tool attack, an attacker uploads a malicious tool to a distribution platform; once a user installs the tool and the LLM agent selects it during task execution, the tool can compromise the user's security and privacy. Prior work primarily focuses on manipulating tool names and descriptions to increase the likelihood of installation by users and selection by LLM agents. However, a successful attack also requires embedding malicious behaviors in the tool's code implementation, which remains largely unexplored. In this work, we bridge this gap by presenting the first systematic study of malicious tool code implementations. We first propose a taxonomy of malicious tool behaviors based on the confidentiality-integrity-availability triad, tailored to LLM-agent settings. To investigate the severity of the risks posed by attackers exploiting coding LLMs to automatically generate malicious tools, we develop MalTool, a coding-LLM-based framework that synthesizes tools exhibiting specified malicious behaviors, either as standalone tools or embedded within otherwise benign implementations. To ensure functional correctness and structural diversity, MalTool leverages an automated verifier that validates whether generated tools exhibit the intended malicious behaviors and differ sufficiently from prior instances, iteratively refining generations until success. Our evaluation demonstrates that MalTool is highly effective even when coding LLMs are safety-aligned. Using MalTool, we construct two datasets of malicious tools: 1,200 standalone malicious tools and 5,287 real-world tools with embedded malicious behaviors. We further show that existing detection methods, including commercial malware detection approaches such as VirusTotal and methods tailored to the LLM-agent setting, exhibit limited effectiveness at detecting the malicious tools, highlighting an urgent need for new defenses.

翻译：在恶意工具攻击中，攻击者将恶意工具上传至分发平台；一旦用户安装该工具，且LLM智能体在执行任务时选择了它，该工具便可危及用户的安全与隐私。先前的研究主要集中于操纵工具名称和描述，以提高用户安装及LLM智能体选择的概率。然而，一次成功的攻击还需要在工具的代码实现中嵌入恶意行为，这一方面在很大程度上尚未得到探索。本研究通过首次系统性地探究恶意工具的代码实现，填补了这一空白。我们首先基于机密性-完整性-可用性三元组，提出了一种针对LLM智能体环境的恶意工具行为分类法。为了探究攻击者利用编码LLM自动生成恶意工具所带来风险的严重性，我们开发了MalTool——一个基于编码LLM的框架，能够合成表现出指定恶意行为的工具，这些工具可以是独立的恶意工具，也可以嵌入在原本良性的实现中。为确保功能正确性和结构多样性，MalTool利用一个自动化验证器来验证生成的工具是否表现出预期的恶意行为，并与先前实例有足够差异，通过迭代优化生成直至成功。我们的评估表明，即使编码LLM经过安全对齐，MalTool仍具有很高的有效性。利用MalTool，我们构建了两个恶意工具数据集：1,200个独立恶意工具和5,287个嵌入了恶意行为的真实世界工具。我们进一步发现，现有的检测方法（包括商业恶意软件检测方法如VirusTotal，以及针对LLM智能体环境定制的方法）在检测这些恶意工具时效果有限，这凸显了对新防御措施的迫切需求。