The system log generated in a computer system refers to large-scale data that are collected simultaneously and used as the basic data for determining errors, intrusion and abnormal behaviors. The aim of system log anomaly detection is to promptly identify anomalies while minimizing human intervention, which is a critical problem in the industry. Previous studies performed anomaly detection through algorithms after converting various forms of log data into a standardized template using a parser. Particularly, a template corresponding to a specific event should be defined in advance for all the log data using which the information within the log key may get lost. In this study, we propose LAnoBERT, a parser free system log anomaly detection method that uses the BERT model, exhibiting excellent natural language processing performance. The proposed method, LAnoBERT, learns the model through masked language modeling, which is a BERT-based pre-training method, and proceeds with unsupervised learning-based anomaly detection using the masked language modeling loss function per log key during the test process. In addition, we also propose an efficient inference process to establish a practically applicable pipeline to the actual system. Experiments on three well-known log datasets, i.e., HDFS, BGL, and Thunderbird, show that not only did LAnoBERT yield a higher anomaly detection performance compared to unsupervised learning-based benchmark models, but also it resulted in a comparable performance with supervised learning-based benchmark models.
翻译:计算机系统中生成的系统日志是指同时收集的大规模数据,用作判断错误、入侵和异常行为的基础数据。系统日志异常检测的目标是在最小化人工干预的情况下及时识别异常,这是工业界的一个关键问题。先前的研究通过解析器将各种形式的日志数据转换为标准化模板后,利用算法进行异常检测。特别是,需要针对所有日志数据预先定义与特定事件对应的模板,但这可能导致日志键内部信息的丢失。在本研究中,我们提出了LAnoBERT,一种基于BERT模型的无解析器系统日志异常检测方法,该方法展现了优异的自然语言处理性能。所提出的LAnoBERT方法通过基于BERT的预训练方法——掩码语言建模进行模型学习,并在测试过程中利用每个日志键的掩码语言建模损失函数进行无监督学习驱动的异常检测。此外,我们还提出了一种高效的推理流程,以建立适用于实际系统的实用化流水线。在三个广泛使用的日志数据集(HDFS、BGL和Thunderbird)上的实验表明,LAnoBERT不仅相比基于无监督学习的基准模型取得了更高的异常检测性能,而且与基于监督学习的基准模型相比也获得了可比的性能。