Few-Shot Cross-System Anomaly Trace Classification for Microservice-based systems

Microservice-based systems (MSS) may experience failures in various fault categories due to their complex and dynamic nature. To effectively handle failures, AIOps tools utilize trace-based anomaly detection and root cause analysis. In this paper, we propose a novel framework for few-shot abnormal trace classification for MSS. Our framework comprises two main components: (1) Multi-Head Attention Autoencoder for constructing system-specific trace representations, which enables (2) Transformer Encoder-based Model-Agnostic Meta-Learning to perform effective and efficient few-shot learning for abnormal trace classification. The proposed framework is evaluated on two representative MSS, Trainticket and OnlineBoutique, with open datasets. The results show that our framework can adapt the learned knowledge to classify new, unseen abnormal traces of novel fault categories both within the same system it was initially trained on and even in the different MSS. Within the same MSS, our framework achieves an average accuracy of 93.26\% and 85.2\% across 50 meta-testing tasks for Trainticket and OnlineBoutique, respectively, when provided with 10 instances for each task. In a cross-system context, our framework gets an average accuracy of 92.19\% and 84.77\% for the same meta-testing tasks of the respective system, also with 10 instances provided for each task. Our work demonstrates the applicability of achieving few-shot abnormal trace classification for MSS and shows how it can enable cross-system adaptability. This opens an avenue for building more generalized AIOps tools that require less system-specific data labeling for anomaly detection and root cause analysis.

翻译：微服务系统因其复杂和动态的特性，可能面临多种故障类别的失效。为有效处理故障，AIOps工具利用基于轨迹的异常检测和根因分析。本文提出一种针对微服务系统少样本异常轨迹分类的新框架。该框架包含两个主要组件：(1) 多头注意力自编码器，用于构建系统特定的轨迹表征，进而支持(2) 基于Transformer编码器的模型无关元学习，以实现高效且有效的少样本异常轨迹分类。所提框架在两个代表性微服务系统 Trainticket 和 OnlineBoutique 上，使用公开数据集进行评估。结果表明，该框架能够将学到的知识迁移至新的、未见过的异常轨迹分类任务中，这些轨迹可能源自初始训练系统内的新故障类别，甚至来自不同的微服务系统。在同一微服务系统内，当每个元测试任务提供10个实例时，我们的框架在 Trainticket 和 OnlineBoutique 上的平均准确率分别达到93.26%和85.2%。在跨系统场景下，对于相同元测试任务，且每个任务同样提供10个实例时，我们的框架在对应系统上的平均准确率分别达到92.19%和84.77%。我们的工作证明了在微服务系统中实现少样本异常轨迹分类的可行性，并展示了其跨系统适应能力。这为构建更通用的 AIOps 工具开辟了新途径，这类工具在异常检测和根因分析中需更少依赖系统特定的数据标注。