Prompt and accurate detection of system anomalies is essential to ensure the reliability of software systems. Unlike manual efforts that exploit all available run-time information, existing approaches usually leverage only a single type of monitoring data (often logs or metrics) or fail to make effective use of the joint information among multi-source data. Consequently, many false predictions occur. To better understand the manifestations of system anomalies, we conduct a comprehensive empirical study based on a large amount of heterogeneous data, i.e., logs and metrics. Our study demonstrates that system anomalies could manifest distinctly in different data types. Thus, integrating heterogeneous data can help recover the complete picture of a system's health status. In this context, we propose HADES, the first work to effectively identify system anomalies based on heterogeneous data. Our approach employs a hierarchical architecture to learn a global representation of the system status by fusing log semantics and metric patterns. It captures discriminative features and meaningful interactions from multi-modal data via a novel cross-modal attention module, enabling accurate system anomaly detection. We evaluate HADES extensively on large-scale simulated and industrial datasets. The experimental results present the superiority of HADES in detecting system anomalies on heterogeneous data. We release the code and the annotated dataset for reproducibility and future research.
翻译:及时准确地检测系统异常对于确保软件系统的可靠性至关重要。与利用所有可用运行时信息的人工方法不同,现有方法通常仅依赖单一类型的监控数据(通常是日志或指标),或未能有效利用多源数据中的联合信息,从而导致大量误报。为深入理解系统异常的表征,我们基于大量异构数据(即日志和指标)开展了全面的实证研究。研究表明,系统异常在不同数据类型中可能呈现截然不同的表现。因此,整合异构数据有助于还原系统健康状况的完整图景。基于此,我们提出HADES——首个基于异构数据有效识别系统异常的工作。该方法采用分层架构,通过融合日志语义与指标模式学习系统状态的全局表征。其通过新颖的跨模态注意力模块从多模态数据中捕捉判别性特征及有意义交互,从而实现准确的系统异常检测。我们在大规模仿真数据集与工业数据集上对HADES进行了广泛评估。实验结果表明,HADES在异构数据上的系统异常检测性能具有优越性。为促进可复现性与后续研究,我们公开了代码与标注数据集。