While finetuning AI agents on interaction data -- such as web browsing or tool use -- improves their capabilities, it also introduces critical security vulnerabilities within the agentic AI supply chain. We show that adversaries can effectively poison the data collection pipeline at multiple stages to embed hard-to-detect backdoors that, when triggered, cause unsafe or malicious behavior. We formalize three realistic threat models across distinct layers of the supply chain: direct poisoning of finetuning data, pre-backdoored base models, and environment poisoning, a novel attack vector that exploits vulnerabilities specific to agentic training pipelines. Evaluated on two widely adopted agentic benchmarks, all three threat models prove effective: poisoning only a small number of demonstrations is sufficient to embed a backdoor that causes an agent to leak confidential user information with over 80\% success.
翻译:尽管在交互数据(如网络浏览或工具使用)上微调AI智能体可提升其能力,但这也为智能体AI供应链引入了关键安全漏洞。我们证明,攻击者可在多个阶段有效污染数据收集流水线,嵌入难以检测的后门——当这些后门被触发时,将导致不安全或恶意行为。我们形式化定义了供应链不同层级的三种现实威胁模型:对微调数据的直接投毒、预植入后门的基础模型,以及环境投毒——一种利用智能体训练流水线特有漏洞的新型攻击向量。在两种广泛采用的智能体基准测试中评估表明,所有三种威胁模型均有效:仅需投毒少量示范数据即可嵌入后门,导致智能体以超过80%的成功率泄露用户机密信息。