Cross-System Categorization of Abnormal Traces in Microservice-Based Systems via Meta-Learning

Microservice-based systems (MSS) may fail with various fault types. While existing AIOps methods excel at detecting abnormal traces and locating the responsible service(s), human efforts are still required for diagnosing specific fault types and failure causes.This paper presents TraFaultDia, a novel AIOps framework to automatically classify abnormal traces into fault categories for MSS. We treat the classification process as a series of multi-class classification tasks, where each task represents an attempt to classify abnormal traces into specific fault categories for a MSS. TraFaultDia leverages meta-learning to train on several abnormal trace classification tasks with a few labeled instances from a MSS, enabling quick adaptation to new, unseen abnormal trace classification tasks with a few labeled instances across MSS. TraFaultDia's use cases are scalable depending on how fault categories are built from anomalies within MSS. We evaluated TraFaultDia on two MSS, TrainTicket and OnlineBoutique, with open datasets where each fault category is linked to faulty system components (service/pod) and a root cause. TraFaultDia automatically classifies abnormal traces into these fault categories, thus enabling the automatic identification of faulty system components and root causes without manual analysis. TraFaultDia achieves 93.26% and 85.20% accuracy on 50 new classification tasks for TrainTicket and OnlineBoutique, respectively, when trained within the same MSS with 10 labeled instances per category. In the cross-system context, when TraFaultDia is applied to a MSS different from the one it is trained on, TraFaultDia gets an average accuracy of 92.19% and 84.77% for the same set of 50 new, unseen abnormal trace classification tasks of the respective systems, also with 10 labeled instances provided for each fault category per task in each system.

翻译：微服务系统可能因多种故障类型而失效。尽管现有AIOps方法在检测异常追踪和定位责任服务方面表现出色，但诊断具体故障类型和失效原因仍需人工介入。本文提出TraFaultDia——一种创新的AIOps框架，能够自动将微服务系统中的异常追踪分类至故障类别。我们将分类过程视为一系列多类别分类任务，每个任务代表将特定微服务系统的异常追踪分类至具体故障类别的尝试。TraFaultDia利用元学习技术，通过少量标注实例对多个异常追踪分类任务进行训练，从而能够快速适应跨微服务系统的、未见过的异常追踪分类任务。TraFaultDia的应用场景可根据微服务系统内部异常构建故障类别的方式进行扩展。我们在TrainTicket和OnlineBoutique两个微服务系统上使用公开数据集评估TraFaultDia，其中每个故障类别均与故障系统组件（服务/容器）及根本原因相关联。TraFaultDia能自动将异常追踪分类至这些故障类别，从而实现无需人工分析的故障系统组件与根本原因自动识别。在相同微服务系统内，当每个类别提供10个标注实例时，TraFaultDia在TrainTicket和OnlineBoutique的50个新分类任务上分别达到93.26%和85.20%的准确率。在跨系统场景中，当TraFaultDia应用于与训练系统不同的微服务系统时，对于各自系统的50个未见过的异常追踪分类任务，在每任务每故障类别提供10个标注实例的条件下，TraFaultDia平均准确率分别达到92.19%和84.77%。