Proactive failure detection of instances is vitally essential to microservice systems because an instance failure can propagate to the whole system and degrade the system's performance. Over the years, many single-modal (i.e., metrics, logs, or traces) data-based nomaly detection methods have been proposed. However, they tend to miss a large number of failures and generate numerous false alarms because they ignore the correlation of multimodal data. In this work, we propose AnoFusion, an unsupervised failure detection approach, to proactively detect instance failures through multimodal data for microservice systems. It applies a Graph Transformer Network (GTN) to learn the correlation of the heterogeneous multimodal data and integrates a Graph Attention Network (GAT) with Gated Recurrent Unit (GRU) to address the challenges introduced by dynamically changing multimodal data. We evaluate the performance of AnoFusion through two datasets, demonstrating that it achieves the F1-score of 0.857 and 0.922, respectively, outperforming the state-of-the-art failure detection approaches.
翻译:实例的主动故障检测对微服务系统至关重要,因为实例故障可能扩散至整个系统并降低系统性能。多年来,学术界提出了许多基于单模态数据(如指标、日志或调用链)的异常检测方法。然而,这些方法由于忽略了多模态数据的关联性,往往会导致大量故障漏检并产生众多误报。在本文中,我们提出了一种无监督故障检测方法AnoFusion,旨在通过多模态数据主动检测微服务系统中的实例故障。该方法采用图变换器网络学习异构多模态数据的关联性,并整合图注意力网络与门控循环单元以应对动态变化的多模态数据带来的挑战。通过两个数据集对AnoFusion进行性能评估,结果表明其F1分数分别达到0.857和0.922,优于当前最先进的故障检测方法。