Prompt and accurate detection of system anomalies is essential to ensure the reliability of software systems. Unlike manual efforts that exploit all available run-time information, existing approaches usually leverage only a single type of monitoring data (often logs or metrics) or fail to make effective use of the joint information among different types of data. Consequently, many false predictions occur. To better understand the manifestations of system anomalies, we conduct a systematical study on a large amount of heterogeneous data, i.e., logs and metrics. Our study demonstrates that logs and metrics can manifest system anomalies collaboratively and complementarily, and neither of them only is sufficient. Thus, integrating heterogeneous data can help recover the complete picture of a system's health status. In this context, we propose Hades, the first end-to-end semi-supervised approach to effectively identify system anomalies based on heterogeneous data. Our approach employs a hierarchical architecture to learn a global representation of the system status by fusing log semantics and metric patterns. It captures discriminative features and meaningful interactions from heterogeneous data via a cross-modal attention module, trained in a semi-supervised manner. We evaluate Hades extensively on large-scale simulated data and datasets from Huawei Cloud. The experimental results present the effectiveness of our model in detecting system anomalies. We also release the code and the annotated dataset for replication and future research.
翻译:及时准确地检测系统异常对于确保软件系统的可靠性至关重要。与利用所有可用运行时信息的人工方法不同,现有方法通常仅依赖单类监控数据(通常为日志或指标),或未能有效利用不同类型数据间的联合信息,从而导致大量误判。为深入理解系统异常的表现形式,我们对海量异构数据(即日志与指标)开展了系统性研究。研究表明,日志与指标能够协同且互补地反映系统异常,单一数据源均不足以完整描述系统状态。因此,整合异构数据有助于重构系统健康状态的全貌。基于此,我们提出Hades——首个基于异构数据有效识别系统异常的端到端半监督方法。该方法采用层次化架构,通过融合日志语义与指标模式学习系统状态的全局表征,并借助跨模态注意力模块以半监督方式从异构数据中捕获判别性特征及有意义的交互信息。我们在大规模仿真数据与华为云数据集上对Hades进行了广泛评估,实验结果验证了该模型在检测系统异常中的有效性。同时,我们已公开代码与标注数据集,以供复现及后续研究。