MAStrike: Shapley-Guided Collusive Red-Teaming on Multi-Agent Systems

Hierarchical multi-agent systems (MAS) are rapidly being deployed in high-stakes workflows across domains such as finance and software engineering. In these systems, safety and security are inherently distributed across role-specialized agents, significantly expanding the attack surface, particularly under coordinated adversarial behaviors such as privilege escalation and cross-agent collusion. Existing red-teaming approaches for MAS remain limited: they rely on heuristic selection of target agents and perturb isolated message streams, leaving critical questions unanswered as which agents are most responsible for system safety, and how compromised agents can coordinate to bypass defenses. We propose MAStrike, a closed-loop framework for collusive red-teaming in hierarchical MAS. We propose the first agent-level Shapley value analysis for MAS, quantifying each agent's marginal contribution to system robustness under task-specific distributions. GGuided by this attribution, MAStrike identifies vulnerable agent coalitions and generates coordinated, role-aware adversarial manipulations. These attacks are iteratively refined through structured causal diagnosis, attributing failure cases to uncompromised agents that block adversarial attempts. We further build a comprehensive MAS red-teaming benchmark and controllable environments spanning diverse hierarchical topologies and domains, including finance, software engineering, and CRM. Extensive experiments across MAS built on multiple frontier models show that MAStrike substantially outperforms heuristic baselines. Our analysis further uncovers non-trivial Shapley value distributions and higher-order interaction structures among agents, revealing critical vulnerabilities and coordination patterns that are overlooked by prior single-agent or template-based methods.

翻译：分层多智能体系统正快速部署于金融、软件工程等高安全性要求的工作流中。此类系统的安全属性天然分布于不同角色专业化的智能体之间，显著扩大了攻击面——特别是面对权限提升与跨智能体合谋等协同对抗行为时。现有面向多智能体系统的红队测试方法存在局限性：它们依赖启发式方法选择目标智能体并扰动孤立消息流，未能回答关键问题——哪些智能体对系统安全负首要责任，以及被攻陷的智能体如何通过协同绕过防御机制。我们提出MAStrike，一个面向分层多智能体系统的合谋红队测试闭环框架。首次提出基于智能体级别的夏普利值分析方法，量化每个智能体在任务特定分布下对系统鲁棒性的边际贡献。在此归因引导下，MAStrike识别脆弱智能体联盟，并生成协同的、基于角色感知的对抗性扰动。通过结构化因果诊断机制，系统将失败案例归因于阻挡攻击路径的未被攻陷智能体，实现攻击策略的迭代优化。进一步构建了涵盖金融、软件工程、客户关系管理等多领域分层拓扑的综合性多智能体系统红队测试基准与可控环境。基于多个前沿模型构建的多智能体系统实验表明，MAStrike显著优于启发式基线方法。分析还揭示了智能体间非平凡的夏普利值分布与高阶交互结构，发现了传统单智能体或模板化方法忽略的关键脆弱性与协同模式。