TrinityGuard: A Unified Framework for Safeguarding Multi-Agent Systems

With the rapid development of LLM-based multi-agent systems (MAS), their significant safety and security concerns have emerged, which introduce novel risks going beyond single agents or LLMs. Despite attempts to address these issues, the existing literature lacks a cohesive safeguarding system specialized for MAS risks. In this work, we introduce TrinityGuard, a comprehensive safety evaluation and monitoring framework for LLM-based MAS, grounded in the OWASP standards. Specifically, TrinityGuard encompasses a three-tier fine-grained risk taxonomy that identifies 20 risk types, covering single-agent vulnerabilities, inter-agent communication threats, and system-level emergent hazards. Designed for scalability across various MAS structures and platforms, TrinityGuard is organized in a trinity manner, involving an MAS abstraction layer that can be adapted to any MAS structures, an evaluation layer containing risk-specific test modules, alongside runtime monitor agents coordinated by a unified LLM Judge Factory. During Evaluation, TrinityGuard executes curated attack probes to generate detailed vulnerability reports for each risk type, where monitor agents analyze structured execution traces and issue real-time alerts, enabling both pre-development evaluation and runtime monitoring. We further formalize these safety metrics and present detailed case studies across various representative MAS examples, showcasing the versatility and reliability of TrinityGuard. Overall, TrinityGuard acts as a comprehensive framework for evaluating and monitoring various risks in MAS, paving the way for further research into their safety and security.

翻译：随着基于大语言模型的多智能体系统的快速发展，其显著的安全与安保问题日益凸显，这些风险超越了单智能体或大语言模型所面临的挑战。尽管已有研究尝试解决这些问题，但现有文献仍缺乏专门针对多智能体系统风险的统一保障体系。本文提出TrinityGuard——一个基于OWASP标准的、针对基于大语言模型的多智能体系统的综合性安全评估与监控框架。具体而言，TrinityGuard构建了一个包含20种风险类型的三层细粒度风险分类体系，涵盖单智能体漏洞、智能体间通信威胁以及系统级涌现性危害。该框架采用三位一体架构设计，具备跨多种多智能体系统结构与平台的可扩展性：包括可适配任意多智能体系统结构的抽象层、包含风险专项测试模块的评估层，以及由统一的大语言模型裁判工厂协调的运行时监控智能体。在评估阶段，TrinityGuard执行精心设计的攻击探针，为每类风险生成详细漏洞报告；监控智能体则通过分析结构化执行轨迹实时发布警报，实现开发前评估与运行时监控的双重保障。我们进一步形式化了这些安全度量指标，并在多个典型多智能体系统示例中展开详细案例研究，展示了TrinityGuard的通用性与可靠性。总体而言，TrinityGuard作为评估与监控多智能体系统各类风险的综合性框架，为相关安全与安保研究的深入探索铺平了道路。