TrioXpert：面向微服务系统的自动化故障管理框架 (TrioXpert: An Automated Incident Management Framework for Microservice System)

Automated incident management plays a pivotal role in large-scale microservice systems. However, many existing methods rely solely on single-modal data (e.g., metrics, logs, and traces) and struggle to simultaneously address multiple downstream tasks, including anomaly detection (AD), failure triage (FT), and root cause localization (RCL). Moreover, the lack of clear reasoning evidence in current techniques often leads to insufficient interpretability. To address these limitations, we propose TrioXpert, an end-to-end incident management framework capable of fully leveraging multimodal data. TrioXpert designs three independent data processing pipelines based on the inherent characteristics of different modalities, comprehensively characterizing the operational status of microservice systems from both numerical and textual dimensions. It employs a collaborative reasoning mechanism using large language models (LLMs) to simultaneously handle multiple tasks while providing clear reasoning evidence to ensure strong interpretability. We conducted extensive evaluations on two microservice system datasets, and the experimental results demonstrate that TrioXpert achieves outstanding performance in AD (improving by 4.7% to 57.7%), FT (improving by 2.1% to 40.6%), and RCL (improving by 1.6% to 163.1%) tasks. TrioXpert has also been deployed in Lenovo's production environment, demonstrating substantial gains in diagnostic efficiency and accuracy.

翻译：自动化故障管理在大规模微服务系统中发挥着关键作用。然而，现有方法大多仅依赖单一模态数据（如指标、日志和追踪），难以同时处理异常检测、故障分诊和根因定位等多个下游任务。此外，当前技术缺乏清晰的推理依据，往往导致可解释性不足。为应对这些局限，本文提出TrioXpert——一种能够充分利用多模态数据的端到端故障管理框架。TrioXpert根据不同模态的内在特性设计了三条独立的数据处理流水线，从数值和文本维度全面刻画微服务系统的运行状态。该框架采用基于大语言模型的协同推理机制，在同步处理多任务的同时提供清晰的推理证据，确保强可解释性。我们在两个微服务系统数据集上进行了广泛评估，实验结果表明TrioXpert在异常检测（提升4.7%至57.7%）、故障分诊（提升2.1%至40.6%）和根因定位（提升1.6%至163.1%）任务中均取得卓越性能。TrioXpert已在联想生产环境中部署，在诊断效率与准确性方面展现出显著提升。

相关内容

Automator

关注 5

Automator是苹果公司为他们的Mac OS X系统开发的一款软件。 只要通过点击拖拽鼠标等操作就可以将一系列动作组合成一个工作流，从而帮助你自动的（可重复的）完成一些复杂的工作。Automator还能横跨很多不同种类的程序，包括：查找器、Safari网络浏览器、iCal、地址簿或者其他的一些程序。它还能和一些第三方的程序一起工作，如微软的Office、Adobe公司的Photoshop或者Pixelmator等。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日