MAS-FIRE: Fault Injection and Reliability Evaluation for LLM-Based Multi-Agent Systems

As LLM-based Multi-Agent Systems (MAS) are increasingly deployed for complex tasks, ensuring their reliability has become a pressing challenge. Since MAS coordinate through unstructured natural language rather than rigid protocols, they are prone to semantic failures (e.g., hallucinations, misinterpreted instructions, and reasoning drift) that propagate silently without raising runtime exceptions. Prevailing evaluation approaches, which measure only end-to-end task success, offer limited insight into how these failures arise or how effectively agents recover from them. To bridge this gap, we propose MAS-FIRE, a systematic framework for fault injection and reliability evaluation of MAS. We define a taxonomy of 15 fault types covering intra-agent cognitive errors and inter-agent coordination failures, and inject them via three non-invasive mechanisms: prompt modification, response rewriting, and message routing manipulation. Applying MAS-FIRE to three representative MAS architectures, we uncover a rich set of fault-tolerant behaviors that we organize into four tiers: mechanism, rule, prompt, and reasoning. This tiered view enables fine-grained diagnosis of where and why systems succeed or fail. Our findings reveal that stronger foundation models do not uniformly improve robustness. We further show that architectural topology plays an equally decisive role, with iterative, closed-loop designs neutralizing over 40% of faults that cause catastrophic collapse in linear workflows. MAS-FIRE provides the process-level observability and actionable guidance needed to systematically improve multi-agent systems.

翻译：随着基于大型语言模型（LLM）的多智能体系统（MAS）日益广泛地部署于复杂任务，确保其可靠性已成为一项紧迫挑战。由于MAS通过非结构化的自然语言而非严格的协议进行协调，它们容易发生语义层面的故障（例如幻觉、指令误解和推理漂移），这些故障会静默传播而不会引发运行时异常。主流的评估方法仅测量端到端的任务成功率，对于这些故障如何产生或智能体如何有效地从中恢复，提供的见解有限。为弥补这一差距，我们提出了MAS-FIRE，一个用于MAS故障注入与可靠性评估的系统化框架。我们定义了一个包含15种故障类型的分类法，涵盖智能体内部的认知错误和智能体间的协调故障，并通过三种非侵入式机制进行注入：提示词修改、响应重写和消息路由操控。将MAS-FIRE应用于三种代表性的MAS架构后，我们揭示了一系列丰富的容错行为，并将其归纳为四个层级：机制层、规则层、提示词层和推理层。这种分层视角使得对系统在何处及为何成功或失败进行细粒度诊断成为可能。我们的研究结果表明，更强的基座模型并不能一致性地提升鲁棒性。我们进一步证明，架构拓扑结构扮演着同样决定性的角色，迭代式、闭环的设计能够消解超过40%在直线型工作流中会导致灾难性崩溃的故障。MAS-FIRE提供了系统化改进多智能体系统所需的过程级可观测性与可操作的指导。