Logs enable the monitoring of infrastructure status and the performance of associated applications. Logs are also invaluable for diagnosing the root causes of any problems that may arise. Log Anomaly Detection (LAD) pipelines automate the detection of anomalies in logs, providing assistance to site reliability engineers (SREs) in system diagnosis. Log patterns change over time, necessitating updates to the LAD model defining the `normal' log activity profile. In this paper, we introduce a Bayes Factor-based drift detection method that identifies when intervention, retraining, and updating of the LAD model are required with human involvement. We illustrate our method using sequences of log activity, both from unaltered data, and simulated activity with controlled levels of anomaly contamination, based on real collected log data.
翻译:日志能监控基础设施状态及相关应用的性能,同时对于诊断任何可能出现问题的根本原因具有不可替代的价值。日志异常检测(LAD)管道可自动化检测日志中的异常情况,为站点可靠性工程师(SREs)系统诊断提供支持。随着时间推移,日志模式会发生变化,因此需要更新定义“正常”日志活动轮廓的LAD模型。本文提出了一种基于贝叶斯因子的漂移检测方法,用于识别需要人工介入进行LAD模型干预、重新训练和更新的时机。我们基于真实采集的日志数据,分别采用未经修改的原始数据序列以及受控异常污染程度下的模拟活动序列,对该方法进行了验证。