Malicious agents pose significant threats to the reliability and decision-making capabilities of Multi-Agent Systems (MAS) powered by Large Language Models (LLMs). Existing defenses often fall short due to reactive designs or centralized architectures which may introduce single points of failure. To address these challenges, we propose SentinelNet, the first decentralized framework for proactively detecting and mitigating malicious behaviors in multi-agent collaboration. SentinelNet equips each agent with a credit-based detector trained via contrastive learning on augmented adversarial debate trajectories, enabling autonomous evaluation of message credibility and dynamic neighbor ranking via bottom-k elimination to suppress malicious communications. To overcome the scarcity of attack data, it generates adversarial trajectories simulating diverse threats, ensuring robust training. Experiments on MAS benchmarks show SentinelNet achieves near-perfect detection of malicious agents, close to 100% within two debate rounds, and recovers 95% of system accuracy from compromised baselines. By exhibiting strong generalizability across domains and attack patterns, SentinelNet establishes a novel paradigm for safeguarding collaborative MAS.
翻译:恶意智能体对基于大语言模型(LLM)的多智能体系统(MAS)的可靠性和决策能力构成严重威胁。现有防御机制常因反应式设计或集中式架构而存在局限性,后者可能引入单点故障。为解决这些问题,我们提出SentinelNet——首个用于主动检测和缓解多智能体协作中恶意行为的去中心化框架。SentinelNet为每个智能体配备基于信用的检测器,该检测器通过对比学习在增强的对抗性辩论轨迹上进行训练,从而实现对消息可信度的自主评估,并通过底k淘汰机制进行动态邻居排序以抑制恶意通信。为克服攻击数据稀缺问题,该框架生成模拟多种威胁的对抗性轨迹,确保鲁棒训练。在MAS基准测试上的实验表明,SentinelNet能在两轮辩论内以接近100%的准确率近乎完美地检测恶意智能体,并将受损基线的系统准确率恢复至95%。通过展现跨领域和攻击模式的强泛化能力,SentinelNet为保护协作式MAS建立了全新范式。