Few-Shot Cross-System Anomaly Trace Classification for Microservice-based systems

Microservice-based systems (MSS) may experience failures in various fault categories due to their complex and dynamic nature. To effectively handle failures, AIOps tools utilize trace-based anomaly detection and root cause analysis. In this paper, we propose a novel framework for few-shot abnormal trace classification for MSS. Our framework comprises two main components: (1) Multi-Head Attention Autoencoder for constructing system-specific trace representations, which enables (2) Transformer Encoder-based Model-Agnostic Meta-Learning to perform effective and efficient few-shot learning for abnormal trace classification. The proposed framework is evaluated on two representative MSS, Trainticket and OnlineBoutique, with open datasets. The results show that our framework can adapt the learned knowledge to classify new, unseen abnormal traces of novel fault categories both within the same system it was initially trained on and even in the different MSS. Within the same MSS, our framework achieves an average accuracy of 93.26\% and 85.2\% across 50 meta-testing tasks for Trainticket and OnlineBoutique, respectively, when provided with 10 instances for each task. In a cross-system context, our framework gets an average accuracy of 92.19\% and 84.77\% for the same meta-testing tasks of the respective system, also with 10 instances provided for each task. Our work demonstrates the applicability of achieving few-shot abnormal trace classification for MSS and shows how it can enable cross-system adaptability. This opens an avenue for building more generalized AIOps tools that require less system-specific data labeling for anomaly detection and root cause analysis.

翻译：微服务系统因其复杂动态特性可能遭遇多种故障类型的失效。为有效应对失效，AIOps工具采用基于轨迹的异常检测与根因分析。本文提出一种面向微服务系统少样本异常轨迹分类的新型框架，该框架包含两大核心组件：(1) 多头注意力自编码器，用于构建系统特定的轨迹表征；(2) 基于Transformer编码器的模型无关元学习，实现高效少样本异常轨迹分类。通过在Trainticket和OnlineBoutique两个代表性微服务系统的公开数据集上进行评估，结果表明：本框架能将习得知识迁移至同一系统内新型故障类别的未知异常轨迹分类，甚至可跨不同微服务系统使用。在相同系统内，当每个元测试任务提供10个实例时，框架在Trainticket和OnlineBoutique上的50个元测试任务平均准确率分别达93.26%和85.2%；在跨系统场景下，对应系统的相同元测试任务平均准确率分别为92.19%和84.77%（同样每任务10个实例）。本研究验证了在微服务系统中实现少样本异常轨迹分类的可行性，并展示了其跨系统适应性能力，为构建需更少系统特定数据标注的泛化型AIOps工具（用于异常检测与根因分析）开辟了新路径。