As Large Language Model (LLM) agents become more capable, their coordinated use in the form of multi-agent systems is anticipated to emerge as a practical paradigm. Prior work has examined the safety and misuse risks associated with agents. However, much of this has focused on the single-agent case and/or setups missing basic engineering safeguards such as access control, revealing a scarcity of threat modeling in multi-agent systems. We investigate the security vulnerabilities of a popular multi-agent pattern known as the orchestrator setup, in which a central agent decomposes and delegates tasks to specialized agents. Through red-teaming a concrete setup representative of a likely future use case, we demonstrate a novel attack vector, OMNI-LEAK, that compromises several agents to leak sensitive data through a single indirect prompt injection, even in the presence of data access control. We report the susceptibility of frontier models to different categories of attacks, finding that both reasoning and non-reasoning models are vulnerable, even when the attacker lacks insider knowledge of the implementation details. Our work highlights the importance of safety research to generalize from single-agent to multi-agent settings, in order to reduce the serious risks of real-world privacy breaches and financial losses and overall public trust in AI agents.
翻译:随着大型语言模型智能体能力的增强,其以多智能体系统形式进行的协同使用预计将成为一种实用范式。先前的研究已探讨了与智能体相关的安全与滥用风险。然而,这些研究大多集中于单智能体场景和/或缺乏基本工程防护措施(如访问控制)的设置,这表明针对多智能体系统的威胁建模尚显不足。我们研究了一种流行的多智能体模式——编排器设置——的安全漏洞,其中中心智能体负责分解任务并将其委派给专用智能体。通过对一个代表未来可能用例的具体设置进行红队测试,我们展示了一种新型攻击向量OMNI-LEAK,该攻击通过单次间接提示注入,在存在数据访问控制的情况下,仍能攻陷多个智能体以泄露敏感数据。我们报告了前沿模型对不同类别攻击的易感性,发现无论是推理模型还是非推理模型均存在漏洞,即使攻击者不具备对实现细节的内部了解。我们的工作强调了安全研究从单智能体向多智能体设置泛化的重要性,以降低现实世界中隐私泄露与财务损失的严重风险,并维护公众对AI智能体的整体信任。