Try with Simpler -- An Evaluation of Improved Principal Component Analysis in Log-based Anomaly Detection

The rapid growth of deep learning (DL) has spurred interest in enhancing log-based anomaly detection. This approach aims to extract meaning from log events (log message templates) and develop advanced DL models for anomaly detection. However, these DL methods face challenges like heavy reliance on training data, labels, and computational resources due to model complexity. In contrast, traditional machine learning and data mining techniques are less data-dependent and more efficient but less effective than DL. To make log-based anomaly detection more practical, the goal is to enhance traditional techniques to match DL's effectiveness. Previous research in a different domain (linking questions on Stack Overflow) suggests that optimized traditional techniques can rival state-of-the-art DL methods. Drawing inspiration from this concept, we conducted an empirical study. We optimized the unsupervised PCA (Principal Component Analysis), a traditional technique, by incorporating lightweight semantic-based log representation. This addresses the issue of unseen log events in training data, enhancing log representation. Our study compared seven log-based anomaly detection methods, including four DL-based, two traditional, and the optimized PCA technique, using public and industrial datasets. Results indicate that the optimized unsupervised PCA technique achieves similar effectiveness to advanced supervised/semi-supervised DL methods while being more stable with limited training data and resource-efficient. This demonstrates the adaptability and strength of traditional techniques through small yet impactful adaptations.

翻译：深度学习的快速发展激发了人们对增强日志异常检测的兴趣。该方法旨在从日志事件（日志消息模板）中提取含义，并开发用于异常检测的高级深度学习模型。然而，这些深度学习方法面临严重依赖训练数据、标签以及因模型复杂度带来的计算资源等问题。相比之下，传统机器学习和数据挖掘技术对数据的依赖更小，效率更高，但效果不如深度学习方法。为使日志异常检测更实用，目标是增强传统技术以匹配深度学习的有效性。先前在另一个领域（Stack Overflow问题链接）的研究表明，优化后的传统技术可媲美最先进的深度学习方法。受此启发，我们开展了一项实证研究。我们通过引入轻量级基于语义的日志表示，优化了无监督主成分分析（PCA）这一传统技术，解决了训练数据中未见日志事件的问题，增强了日志表示。本研究使用公开数据集和工业数据集，比较了七种日志异常检测方法，包括四种基于深度学习的、两种传统方法以及优化的PCA技术。结果表明，优化的无监督PCA技术可达到与高级有监督/半监督深度学习方法相似的效果，同时在训练数据有限时更稳定且资源效率更高。这证明了传统技术通过微小但有效的调整即可具备的适应性和优势。

相关内容

PCA

关注 3

在统计中，主成分分析（PCA）是一种通过最大化每个维度的方差来将较高维度空间中的数据投影到较低维度空间中的方法。给定二维，三维或更高维空间中的点集合，可以将“最佳拟合”线定义为最小化从点到线的平均平方距离的线。可以从垂直于第一条直线的方向类似地选择下一条最佳拟合线。重复此过程会产生一个正交的基础，其中数据的不同单个维度是不相关的。这些基向量称为主成分。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日