Microservice-based systems (MSS) may experience failures in various fault categories due to their complex and dynamic nature. To effectively handle failures, AIOps tools utilize trace-based anomaly detection and root cause analysis. In this paper, we propose a novel framework for few-shot abnormal trace classification for MSS. Our framework comprises two main components: (1) Multi-Head Attention Autoencoder for constructing system-specific trace representations, which enables (2) Transformer Encoder-based Model-Agnostic Meta-Learning to perform effective and efficient few-shot learning for abnormal trace classification. The proposed framework is evaluated on two representative MSS, Trainticket and OnlineBoutique, with open datasets. The results show that our framework can adapt the learned knowledge to classify new, unseen abnormal traces of novel fault categories both within the same system it was initially trained on and even in the different MSS. Within the same MSS, our framework achieves an average accuracy of 93.26\% and 85.2\% across 50 meta-testing tasks for Trainticket and OnlineBoutique, respectively, when provided with 10 instances for each task. In a cross-system context, our framework gets an average accuracy of 92.19\% and 84.77\% for the same meta-testing tasks of the respective system, also with 10 instances provided for each task. Our work demonstrates the applicability of achieving few-shot abnormal trace classification for MSS and shows how it can enable cross-system adaptability. This opens an avenue for building more generalized AIOps tools that require less system-specific data labeling for anomaly detection and root cause analysis.
翻译:基于微服务系统(MSS)因其复杂动态特性可能面临多种故障类别的失效。为有效处理失效问题,AIOps工具利用基于痕迹的异常检测与根因分析。本文提出一种针对MSS少样本异常痕迹分类的新框架。该框架包含两大核心组件:(1) 多头注意力自编码器用于构建系统特定痕迹表征,进而支持(2) 基于Transformer编码器的模型无关元学习,实现高效且有效的少样本异常痕迹分类。我们在两个代表性微服务系统(Trainticket和OnlineBoutique)上使用公开数据集对该框架进行评估。结果表明,我们的框架能够将已学知识迁移用于分类同一初始训练系统内以及不同MSS中新型故障类别的新增未知异常痕迹。在同一MSS中,当为每个元测试任务提供10个实例时,该框架在Trainticket和OnlineBoutique上分别达到93.26%和85.2%的平均准确率。在跨系统场景下,为各自系统的相同元测试任务同样提供10个实例时,框架分别获得92.19%和84.77%的平均准确率。本工作证明了实现MSS少样本异常痕迹分类的可行性,并展示了其跨系统适配能力,为构建需要更少系统特定数据标注的泛化型AIOps工具(用于异常检测与根因分析)开辟了新途径。