GAL-MAD: Towards Explainable Anomaly Detection in Microservice Applications Using Graph Attention Networks

The transition to microservices has revolutionized software architectures, offering enhanced scalability and modularity. However, the distributed and dynamic nature of microservices introduces complexities in ensuring system reliability, making anomaly detection crucial for maintaining performance and functionality. Anomalies stemming from network and performance issues must be swiftly identified and addressed. Existing anomaly detection techniques often rely on statistical models or machine learning methods that struggle with the high-dimensional, interdependent data inherent in microservice applications. Current techniques and available datasets predominantly focus on system traces and logs, limiting their ability to support advanced detection models. This paper addresses these gaps by introducing the RS-Anomic dataset generated using the open-source RobotShop microservice application. The dataset captures multivariate performance metrics and response times under normal and anomalous conditions, encompassing ten types of anomalies. We propose a novel anomaly detection model called Graph Attention and LSTM-based Microservice Anomaly Detection (GAL-MAD), leveraging Graph Attention and Long Short-Term Memory architectures to capture spatial and temporal dependencies in microservices. We utilize SHAP values to localize anomalous services and identify root causes to enhance explainability. Experimental results demonstrate that GAL-MAD outperforms state-of-the-art models on the RS-Anomic dataset, achieving higher accuracy and recall across varying anomaly rates. The explanations provide actionable insights into service anomalies, which benefits system administrators.

翻译：向微服务的转型已经彻底改变了软件架构，提供了更强的可扩展性和模块化。然而，微服务的分布式与动态特性给确保系统可靠性带来了复杂性，使得异常检测对于维持性能与功能至关重要。由网络和性能问题引发的异常必须被迅速识别与处理。现有的异常检测技术通常依赖于统计模型或机器学习方法，这些方法难以处理微服务应用中固有的高维且相互依赖的数据。当前的技术与可用数据集主要关注系统追踪和日志，限制了其支持高级检测模型的能力。本文通过引入使用开源RobotShop微服务应用生成的RS-Anomic数据集来应对这些不足。该数据集捕获了正常与异常条件下的多元性能指标和响应时间，涵盖了十种异常类型。我们提出了一种名为基于图注意力与长短期记忆的微服务异常检测（GAL-MAD）的新型异常检测模型，利用图注意力与长短期记忆架构来捕捉微服务中的空间与时间依赖关系。我们利用SHAP值来定位异常服务并识别根本原因，以增强可解释性。实验结果表明，在RS-Anomic数据集上，GAL-MAD优于现有最先进的模型，在不同异常率下均实现了更高的准确率与召回率。其解释为服务异常提供了可操作的见解，这对系统管理员大有裨益。