Fully supervised log anomaly detection methods suffer the heavy burden of annotating massive unlabeled log data. Recently, many semi-supervised methods have been proposed to reduce annotation costs with the help of parsed templates. However, these methods consider each keyword independently, which disregards the correlation between keywords and the contextual relationships among log sequences. In this paper, we propose a novel weakly supervised log anomaly detection framework, named LogLG, to explore the semantic connections among keywords from sequences. Specifically, we design an end-to-end iterative process, where the keywords of unlabeled logs are first extracted to construct a log-event graph. Then, we build a subgraph annotator to generate pseudo labels for unlabeled log sequences. To ameliorate the annotation quality, we adopt a self-supervised task to pre-train a subgraph annotator. After that, a detection model is trained with the generated pseudo labels. Conditioned on the classification results, we re-extract the keywords from the log sequences and update the log-event graph for the next iteration. Experiments on five benchmarks validate the effectiveness of LogLG for detecting anomalies on unlabeled log data and demonstrate that LogLG, as the state-of-the-art weakly supervised method, achieves significant performance improvements compared to existing methods.
翻译:全监督日志异常检测方法在处理海量未标注日志数据时需要承担繁重的标注负担。近年来,许多半监督方法通过利用解析模板降低了标注成本。然而,这些方法将每个关键词独立处理,忽略了关键词之间的关联性以及日志序列中的上下文关系。本文提出了一种名为LogLG的新型弱监督日志异常检测框架,旨在探索序列中关键词之间的语义联系。具体而言,我们设计了一个端到端的迭代过程:首先从未标注日志中提取关键词以构建日志事件图,随后构建子图标注器为未标注日志序列生成伪标签。为提升标注质量,我们采用自监督任务对子图标注器进行预训练。基于生成的伪标签训练检测模型后,根据分类结果重新提取日志序列中的关键词,并更新日志事件图以进入下一轮迭代。在五个基准数据集上的实验验证了LogLG在未标注日志数据上检测异常的有效性,同时表明作为当前最先进的弱监督方法,LogLG相较于现有方法取得了显著的性能提升。