Software log analysis can be laborious and time consuming. Time and labeled data are usually lacking in industrial settings. This paper studies unsupervised and time efficient methods for anomaly detection. We study two custom and two established models. The custom models are: an OOV (Out-Of-Vocabulary) detector, which counts the terms in the test data that are not present in the training data, and the Rarity Model (RM), which calculates a rarity score for terms based on their infrequency. The established models are KMeans and Isolation Forest. The models are evaluated on four public datasets (BGL, Thunderbird, Hadoop, HDFS) with three different representation techniques for the log messages (Words, character Trigrams, Parsed events). We used the AUC-ROC metric for the evaluation. The results reveal discrepancies based on the dataset and representation technique. Different configurations are advised based on specific requirements. For speed, the OOV detector with word representation is optimal. For accuracy, the OOV detector combined with trigram representation yields the highest AUC-ROC (0.846). When dealing with unfiltered data where training includes both normal and anomalous instances, the most effective combination is the Isolation Forest with event representation, achieving an AUC-ROC of 0.829.
翻译:软件日志分析可能繁琐且耗时。工业场景中通常缺乏时间和标注数据。本文研究用于异常检测的无监督且时间高效的方法。我们研究了两种定制模型和两种成熟模型。定制模型包括:OOV(词汇表外)检测器,用于统计测试数据中未出现在训练数据中的词条;以及稀有度模型,根据词条的低频性计算其稀有度得分。成熟模型为KMeans和孤立森林。我们在四个公开数据集(BGL、Thunderbird、Hadoop、HDFS)上评估这些模型,并采用三种不同的日志消息表示技术(单词、字符三元组、解析事件)。评估指标使用AUC-ROC。结果显示,基于数据集和表示技术存在差异。根据具体需求建议不同的配置。在速度方面,采用单词表示的OOV检测器表现最优。在准确性方面,OOV检测器结合三元组表示获得最高AUC-ROC(0.846)。当处理包含正常和异常实例的未过滤训练数据时,最有效的组合是孤立森林结合事件表示,其AUC-ROC达到0.829。