A Comprehensive Study of Machine Learning Techniques for Log-Based Anomaly Detection

Growth in system complexity increases the need for automated techniques dedicated to different log analysis tasks such as Log-based Anomaly Detection (LAD). The latter has been widely addressed in the literature, mostly by means of a variety of deep learning techniques. Despite their many advantages, that focus on deep learning techniques is somewhat arbitrary as traditional Machine Learning (ML) techniques may perform well in many cases, depending on the context and datasets. In the same vein, semi-supervised techniques deserve the same attention as supervised techniques since the former have clear practical advantages. Further, current evaluations mostly rely on the assessment of detection accuracy. However, this is not enough to decide whether or not a specific ML technique is suitable to address the LAD problem in a given context. Other aspects to consider include training and prediction times as well as the sensitivity to hyperparameter tuning, which in practice matters to engineers. In this paper, we present a comprehensive empirical study, in which we evaluate supervised and semi-supervised, traditional and deep ML techniques w.r.t. four evaluation criteria: detection accuracy, time performance, sensitivity of detection accuracy and time performance to hyperparameter tuning. The experimental results show that supervised traditional and deep ML techniques fare similarly in terms of their detection accuracy and prediction time. Moreover, overall, sensitivity analysis to hyperparameter tuning w.r.t. detection accuracy shows that supervised traditional ML techniques are less sensitive than deep learning techniques. Further, semi-supervised techniques yield significantly worse detection accuracy than supervised techniques.

翻译：系统复杂性的增长增加了对自动化技术的需求，这些技术专门用于不同的日志分析任务，例如基于日志的异常检测（LAD）。后者已在文献中得到广泛探讨，主要通过多种深度学习技术实现。尽管深度学习技术具有诸多优势，但这种聚焦在某种程度上是随意性的，因为传统机器学习（ML）技术在许多情况下（取决于具体情境和数据集）可能表现良好。同样，半监督技术也应获得与监督技术同等的关注，因为前者具有明显的实际优势。此外，当前的评估大多依赖于检测准确率的评估，但这并不足以判断特定ML技术是否适用于解决特定情境下的LAD问题。需要考虑的其他方面包括训练和预测时间，以及对超参数调优的敏感性——这些在实践中对工程师至关重要。本文提出了一项全面的实证研究，我们针对四个评估标准评估了监督与半监督、传统与深度ML技术：检测准确率、时间性能、检测准确率与时间性能对超参数调优的敏感性。实验结果表明，监督式传统ML与深度ML技术在检测准确率和预测时间方面表现相似。此外，总体而言，针对检测准确率的超参数调优敏感性分析显示，监督式传统ML技术的敏感性低于深度学习技术。进一步地，半监督技术的检测准确率显著低于监督技术。