Efficiency of Unsupervised Anomaly Detection Methods on Software Logs

Software log analysis can be laborious and time consuming. Time and labeled data are usually lacking in industrial settings. This paper studies unsupervised and time efficient methods for anomaly detection. We study two custom and two established models. The custom models are: an OOV (Out-Of-Vocabulary) detector, which counts the terms in the test data that are not present in the training data, and the Rarity Model (RM), which calculates a rarity score for terms based on their infrequency. The established models are KMeans and Isolation Forest. The models are evaluated on four public datasets (BGL, Thunderbird, Hadoop, HDFS) with three different representation techniques for the log messages (Words, character Trigrams, Parsed events). We used the AUC-ROC metric for the evaluation. The results reveal discrepancies based on the dataset and representation technique. Different configurations are advised based on specific requirements. For speed, the OOV detector with word representation is optimal. For accuracy, the OOV detector combined with trigram representation yields the highest AUC-ROC (0.846). When dealing with unfiltered data where training includes both normal and anomalous instances, the most effective combination is the Isolation Forest with event representation, achieving an AUC-ROC of 0.829.

翻译：软件日志分析可能繁琐且耗时。工业场景中通常缺乏时间和标注数据。本文研究用于异常检测的无监督且时间高效的方法。我们研究了两种定制模型和两种成熟模型。定制模型包括：OOV（词汇表外）检测器，用于统计测试数据中未出现在训练数据中的词条；以及稀有度模型，根据词条的低频性计算其稀有度得分。成熟模型为KMeans和孤立森林。我们在四个公开数据集（BGL、Thunderbird、Hadoop、HDFS）上评估这些模型，并采用三种不同的日志消息表示技术（单词、字符三元组、解析事件）。评估指标使用AUC-ROC。结果显示，基于数据集和表示技术存在差异。根据具体需求建议不同的配置。在速度方面，采用单词表示的OOV检测器表现最优。在准确性方面，OOV检测器结合三元组表示获得最高AUC-ROC（0.846）。当处理包含正常和异常实例的未过滤训练数据时，最有效的组合是孤立森林结合事件表示，其AUC-ROC达到0.829。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日