The rapid growth of deep learning (DL) has spurred interest in enhancing log-based anomaly detection. This approach aims to extract meaning from log events (log message templates) and develop advanced DL models for anomaly detection. However, these DL methods face challenges like heavy reliance on training data, labels, and computational resources due to model complexity. In contrast, traditional machine learning and data mining techniques are less data-dependent and more efficient but less effective than DL. To make log-based anomaly detection more practical, the goal is to enhance traditional techniques to match DL's effectiveness. Previous research in a different domain (linking questions on Stack Overflow) suggests that optimized traditional techniques can rival state-of-the-art DL methods. Drawing inspiration from this concept, we conducted an empirical study. We optimized the unsupervised PCA (Principal Component Analysis), a traditional technique, by incorporating lightweight semantic-based log representation. This addresses the issue of unseen log events in training data, enhancing log representation. Our study compared seven log-based anomaly detection methods, including four DL-based, two traditional, and the optimized PCA technique, using public and industrial datasets. Results indicate that the optimized unsupervised PCA technique achieves similar effectiveness to advanced supervised/semi-supervised DL methods while being more stable with limited training data and resource-efficient. This demonstrates the adaptability and strength of traditional techniques through small yet impactful adaptations.
翻译:深度学习的快速发展激发了人们对增强日志异常检测的兴趣。该方法旨在从日志事件(日志消息模板)中提取含义,并开发用于异常检测的高级深度学习模型。然而,这些深度学习方法面临严重依赖训练数据、标签以及因模型复杂度带来的计算资源等问题。相比之下,传统机器学习和数据挖掘技术对数据的依赖更小,效率更高,但效果不如深度学习方法。为使日志异常检测更实用,目标是增强传统技术以匹配深度学习的有效性。先前在另一个领域(Stack Overflow问题链接)的研究表明,优化后的传统技术可媲美最先进的深度学习方法。受此启发,我们开展了一项实证研究。我们通过引入轻量级基于语义的日志表示,优化了无监督主成分分析(PCA)这一传统技术,解决了训练数据中未见日志事件的问题,增强了日志表示。本研究使用公开数据集和工业数据集,比较了七种日志异常检测方法,包括四种基于深度学习的、两种传统方法以及优化的PCA技术。结果表明,优化的无监督PCA技术可达到与高级有监督/半监督深度学习方法相似的效果,同时在训练数据有限时更稳定且资源效率更高。这证明了传统技术通过微小但有效的调整即可具备的适应性和优势。