RAPID: Training-free Retrieval-based Log Anomaly Detection with PLM considering Token-level information

As the IT industry advances, system log data becomes increasingly crucial. Many computer systems rely on log texts for management due to restricted access to source code. The need for log anomaly detection is growing, especially in real-world applications, but identifying anomalies in rapidly accumulating logs remains a challenging task. Traditional deep learning-based anomaly detection models require dataset-specific training, leading to corresponding delays. Notably, most methods only focus on sequence-level log information, which makes the detection of subtle anomalies harder, and often involve inference processes that are difficult to utilize in real-time. We introduce RAPID, a model that capitalizes on the inherent features of log data to enable anomaly detection without training delays, ensuring real-time capability. RAPID treats logs as natural language, extracting representations using pre-trained language models. Given that logs can be categorized based on system context, we implement a retrieval-based technique to contrast test logs with the most similar normal logs. This strategy not only obviates the need for log-specific training but also adeptly incorporates token-level information, ensuring refined and robust detection, particularly for unseen logs. We also propose the core set technique, which can reduce the computational cost needed for comparison. Experimental results show that even without training on log data, RAPID demonstrates competitive performance compared to prior models and achieves the best performance on certain datasets. Through various research questions, we verified its capability for real-time detection without delay.

翻译：随着IT产业的发展，系统日志数据变得日益重要。由于对源代码访问受限，许多计算机系统依赖日志文本来进行管理。日志异常检测的需求正在增长，尤其是在实际应用中，但快速积累的日志中识别异常仍是一项具有挑战性的任务。传统的基于深度学习的异常检测模型需要针对特定数据集进行训练，导致相应的延迟。值得注意的是，大多数方法仅关注序列级别的日志信息，这使得对细微异常的检测更加困难，并且通常涉及难以实时利用的推理过程。我们提出了RAPID模型，该模型利用日志数据的内在特征，在无训练延迟的情况下实现异常检测，确保实时能力。RAPID将日志视为自然语言，利用预训练语言模型提取表示。鉴于日志可根据系统上下文进行分类，我们实现了一种基于检索的技术，将测试日志与最相似的标准日志进行对比。该策略不仅避免了日志特定训练的需求，还巧妙地融入了Token级别信息，确保了精细且鲁棒的检测，特别是对于未见过的日志。我们还提出了核心集技术，可降低比较所需的计算成本。实验结果表明，即使未在日志数据上进行训练，RAPID仍展现出与先前模型相媲美的竞争性能，并在某些数据集上达到了最优性能。通过多种研究问题，我们验证了其无延迟实时检测的能力。

相关内容

异常检测

关注 102

在数据挖掘中，异常检测（英语：anomaly detection）对不符合预期模式或数据集中其他项目的项目、事件或观测值的识别。通常异常项目会转变成银行欺诈、结构缺陷、医疗问题、文本错误等类型的问题。异常也被称为离群值、新奇、噪声、偏差和例外。特别是在检测滥用与网络入侵时，有趣性对象往往不是罕见对象，但却是超出预料的突发活动。这种模式不遵循通常统计定义中把异常点看作是罕见对象，于是许多异常检测方法（特别是无监督的方法）将对此类数据失效，除非进行了合适的聚集。相反，聚类分析算法可能可以检测出这些模式形成的微聚类。有三大类异常检测方法。[1] 在假设数据集中大多数实例都是正常的前提下，无监督异常检测方法能通过寻找与其他数据最不匹配的实例来检测出未标记测试数据的异常。监督式异常检测方法需要一个已经被标记“正常”与“异常”的数据集，并涉及到训练分类器（与许多其他的统计分类问题的关键区别是异常检测的内在不均衡性）。半监督式异常检测方法根据一个给定的正常训练数据集创建一个表示正常行为的模型，然后检测由学习模型生成的测试实例的可能性。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【ACL2020】多模态信息抽取，365页ppt

专知会员服务

151+阅读 · 2020年7月6日