An Empirical Study on Log-based Anomaly Detection Using Machine Learning

The growth of systems complexity increases the need of automated techniques dedicated to different log analysis tasks such as Log-based Anomaly Detection (LAD). The latter has been widely addressed in the literature, mostly by means of different deep learning techniques. Nevertheless, the focus on deep learning techniques results in less attention being paid to traditional Machine Learning (ML) techniques, which may perform well in many cases, depending on the context and the used datasets. Further, the evaluation of different ML techniques is mostly based on the assessment of their detection accuracy. However, this is is not enough to decide whether or not a specific ML technique is suitable to address the LAD problem. Other aspects to consider include the training and prediction time as well as the sensitivity to hyperparameter tuning. In this paper, we present a comprehensive empirical study, in which we evaluate different supervised and semi-supervised, traditional and deep ML techniques w.r.t. four evaluation criteria: detection accuracy, time performance, sensitivity of detection accuracy as well as time performance to hyperparameter tuning. The experimental results show that supervised traditional and deep ML techniques perform very closely in terms of their detection accuracy and prediction time. Moreover, the overall evaluation of the sensitivity of the detection accuracy of the different ML techniques to hyperparameter tuning shows that supervised traditional ML techniques are less sensitive to hyperparameter tuning than deep learning techniques. Further, semi-supervised techniques yield significantly worse detection accuracy than supervised techniques.

翻译：系统复杂性的增长增加了对自动化技术的需求，这些技术专门用于不同的日志分析任务，例如基于日志的异常检测（Log-based Anomaly Detection，LAD）。后者已在文献中得到广泛研究，主要通过不同的深度学习技术。然而，对深度学习技术的关注导致传统机器学习（Machine Learning，ML）技术受到的关注较少，而这些技术在许多情况下可能表现良好，具体取决于上下文和所使用的数据集。此外，对不同机器学习技术的评估主要基于对其检测准确性的评估。然而，这不足以判断特定机器学习技术是否适合解决LAD问题。其他需要考虑的方面包括训练和预测时间，以及对超参数调整的敏感性。在本文中，我们提出了一项全面的实证研究，其中我们评估了不同的监督式和半监督式、传统和深度机器学习技术，针对四个评估标准：检测准确性、时间性能、检测准确性的敏感性以及时间性能对超参数调整的敏感性。实验结果表明，在检测准确性和预测时间方面，监督式传统和深度机器学习技术的表现非常接近。此外，对不同机器学习技术检测准确性对超参数调整敏感性的整体评估表明，监督式传统机器学习技术对超参数调整的敏感性低于深度学习技术。此外，半监督式技术产生的检测准确性明显低于监督式技术。