The incorporation of LLMs in multi-agent systems (MASs) has the potential to significantly improve our ability to autonomously solve complex problems. However, such systems introduce unique challenges in monitoring, interpreting, and detecting system failures. Most existing MAS observability frameworks focus on analyzing each individual agent separately, overlooking failures associated with the entire MAS. To bridge this gap, we propose LumiMAS, a novel MAS observability framework that incorporates advanced analytics and monitoring techniques. The proposed framework consists of three key components: a monitoring and logging layer, anomaly detection layer, and anomaly explanation layer. LumiMAS's first layer monitors MAS executions, creating detailed logs of the agents' activity. These logs serve as input to the anomaly detection layer, which detects anomalies across the MAS workflow in real time. Then, the anomaly explanation layer performs classification and root cause analysis (RCA) of the detected anomalies. LumiMAS was evaluated on seven different MAS applications, implemented using two popular MAS platforms, and a diverse set of possible failures. The applications include two novel failure-tailored applications that illustrate the effects of a hallucination or bias on the MAS. The evaluation results demonstrate LumiMAS's effectiveness in failure detection, classification, and RCA.
翻译:将大型语言模型(LLMs)融入多智能体系统(MASs)有望显著提升我们自主解决复杂问题的能力。然而,此类系统在监控、解释和检测系统故障方面引入了独特的挑战。现有的大多数MAS可观测性框架侧重于单独分析每个智能体,忽略了与整个MAS相关的故障。为弥补这一差距,我们提出了LumiMAS——一种融合了先进分析与监控技术的新型MAS可观测性框架。该框架包含三个关键组件:监控与日志记录层、异常检测层和异常解释层。LumiMAS的第一层监控MAS执行过程,创建详细的智能体活动日志。这些日志作为异常检测层的输入,实时检测MAS工作流中的异常。随后,异常解释层对检测到的异常进行分类和根本原因分析(RCA)。我们在七个不同的MAS应用上评估了LumiMAS,这些应用基于两种主流MAS平台实现,并涵盖了一系列可能的故障类型。其中包括两个专门针对故障设计的新型应用,用以说明幻觉或偏见对MAS的影响。评估结果证明了LumiMAS在故障检测、分类和根本原因分析方面的有效性。