Log-based software reliability maintenance systems are crucial for sustaining stable customer experience. However, existing deep learning-based methods represent a black box for service providers, making it impossible for providers to understand how these methods detect anomalies, thereby hindering trust and deployment in real production environments. To address this issue, this paper defines a trustworthiness metric, diagnostic faithfulness, for models to gain service providers' trust, based on surveys of SREs at a major cloud provider. We design two evaluation tasks: attention-based root cause localization and event perturbation. Empirical studies demonstrate that existing methods perform poorly in diagnostic faithfulness. Consequently, we propose FaithLog, a faithful log-based anomaly detection system, which achieves faithfulness through a carefully designed causality-guided attention mechanism and adversarial consistency learning. Evaluation results on two public datasets and one industrial dataset demonstrate that the proposed method achieves state-of-the-art performance in diagnostic faithfulness.
翻译:基于日志的软件可靠性维护系统对于维持稳定的客户体验至关重要。然而,现有的基于深度学习的方法对服务提供商而言是一个黑盒,使得提供商无法理解这些方法如何检测异常,从而阻碍了其在真实生产环境中的信任与部署。为解决此问题,本文基于对一家主要云服务提供商站点可靠性工程师的调研,定义了一个可信度指标——诊断忠实度,以使模型获得服务提供商的信任。我们设计了两个评估任务:基于注意力的根因定位和事件扰动。实证研究表明,现有方法在诊断忠实度方面表现不佳。因此,我们提出了FaithLog,一个忠实的基于日志的异常检测系统,其通过精心设计的因果引导注意力机制和对抗一致性学习来实现忠实性。在两个公共数据集和一个工业数据集上的评估结果表明,所提方法在诊断忠实度方面达到了最先进的性能。