Automated incident management plays a pivotal role in large-scale microservice systems. However, many existing methods rely solely on single-modal data (e.g., metrics, logs, and traces) and struggle to simultaneously address multiple downstream tasks, including anomaly detection (AD), failure triage (FT), and root cause localization (RCL). Moreover, the lack of clear reasoning evidence in current techniques often leads to insufficient interpretability. To address these limitations, we propose TrioXpert, an end-to-end incident management framework capable of fully leveraging multimodal data. TrioXpert designs three independent data processing pipelines based on the inherent characteristics of different modalities, comprehensively characterizing the operational status of microservice systems from both numerical and textual dimensions. It employs a collaborative reasoning mechanism using large language models (LLMs) to simultaneously handle multiple tasks while providing clear reasoning evidence to ensure strong interpretability. We conducted extensive evaluations on two microservice system datasets, and the experimental results demonstrate that TrioXpert achieves outstanding performance in AD (improving by 4.7% to 57.7%), FT (improving by 2.1% to 40.6%), and RCL (improving by 1.6% to 163.1%) tasks. TrioXpert has also been deployed in Lenovo's production environment, demonstrating substantial gains in diagnostic efficiency and accuracy.
翻译:自动化故障管理在大规模微服务系统中发挥着关键作用。然而,现有方法大多仅依赖单一模态数据(如指标、日志和追踪),难以同时处理异常检测、故障分诊和根因定位等多个下游任务。此外,当前技术缺乏清晰的推理依据,往往导致可解释性不足。为应对这些局限,本文提出TrioXpert——一种能够充分利用多模态数据的端到端故障管理框架。TrioXpert根据不同模态的内在特性设计了三条独立的数据处理流水线,从数值和文本维度全面刻画微服务系统的运行状态。该框架采用基于大语言模型的协同推理机制,在同步处理多任务的同时提供清晰的推理证据,确保强可解释性。我们在两个微服务系统数据集上进行了广泛评估,实验结果表明TrioXpert在异常检测(提升4.7%至57.7%)、故障分诊(提升2.1%至40.6%)和根因定位(提升1.6%至163.1%)任务中均取得卓越性能。TrioXpert已在联想生产环境中部署,在诊断效率与准确性方面展现出显著提升。