Fully supervised log anomaly detection methods suffer the heavy burden of annotating massive unlabeled log data. Recently, many semi-supervised methods have been proposed to reduce annotation costs with the help of parsed templates. However, these methods consider each keyword independently, which disregards the correlation between keywords and the contextual relationships among log sequences. In this paper, we propose a novel weakly supervised log anomaly detection framework, named LogLG, to explore the semantic connections among keywords from sequences. Specifically, we design an end-to-end iterative process, where the keywords of unlabeled logs are first extracted to construct a log-event graph. Then, we build a subgraph annotator to generate pseudo labels for unlabeled log sequences. To ameliorate the annotation quality, we adopt a self-supervised task to pre-train a subgraph annotator. After that, a detection model is trained with the generated pseudo labels. Conditioned on the classification results, we re-extract the keywords from the log sequences and update the log-event graph for the next iteration. Experiments on five benchmarks validate the effectiveness of LogLG for detecting anomalies on unlabeled log data and demonstrate that LogLG, as the state-of-the-art weakly supervised method, achieves significant performance improvements compared to existing methods.
翻译:全监督日志异常检测方法需要为大量无标签日志数据标注,负担沉重。近期,许多半监督方法通过利用解析后的模板来降低标注成本。然而,这些方法独立考虑每个关键词,忽略了关键词之间的关联性以及日志序列中的上下文关系。本文提出了一种新型弱监督日志异常检测框架LogLG,旨在探索序列中关键词之间的语义连接。具体而言,我们设计了一个端到端的迭代过程:首先提取无标签日志中的关键词以构建日志事件图;随后,构建子图标注器为无标签日志序列生成伪标签。为提升标注质量,我们采用自监督任务预训练子图标注器。接着,利用生成的伪标签训练检测模型。根据分类结果,我们重新提取日志序列中的关键词并更新日志事件图,进入下一轮迭代。在五个基准数据集上的实验验证了LogLG检测无标签日志数据中异常的有效性,并表明作为最先进的弱监督方法,LogLG相比现有方法实现了显著的性能提升。