As the IT industry advances, system log data becomes increasingly crucial. Many computer systems rely on log texts for management due to restricted access to source code. The need for log anomaly detection is growing, especially in real-world applications, but identifying anomalies in rapidly accumulating logs remains a challenging task. Traditional deep learning-based anomaly detection models require dataset-specific training, leading to corresponding delays. Notably, most methods only focus on sequence-level log information, which makes the detection of subtle anomalies harder, and often involve inference processes that are difficult to utilize in real-time. We introduce RAPID, a model that capitalizes on the inherent features of log data to enable anomaly detection without training delays, ensuring real-time capability. RAPID treats logs as natural language, extracting representations using pre-trained language models. Given that logs can be categorized based on system context, we implement a retrieval-based technique to contrast test logs with the most similar normal logs. This strategy not only obviates the need for log-specific training but also adeptly incorporates token-level information, ensuring refined and robust detection, particularly for unseen logs. We also propose the core set technique, which can reduce the computational cost needed for comparison. Experimental results show that even without training on log data, RAPID demonstrates competitive performance compared to prior models and achieves the best performance on certain datasets. Through various research questions, we verified its capability for real-time detection without delay.
翻译:随着IT产业的发展,系统日志数据变得日益重要。由于对源代码访问受限,许多计算机系统依赖日志文本来进行管理。日志异常检测的需求正在增长,尤其是在实际应用中,但快速积累的日志中识别异常仍是一项具有挑战性的任务。传统的基于深度学习的异常检测模型需要针对特定数据集进行训练,导致相应的延迟。值得注意的是,大多数方法仅关注序列级别的日志信息,这使得对细微异常的检测更加困难,并且通常涉及难以实时利用的推理过程。我们提出了RAPID模型,该模型利用日志数据的内在特征,在无训练延迟的情况下实现异常检测,确保实时能力。RAPID将日志视为自然语言,利用预训练语言模型提取表示。鉴于日志可根据系统上下文进行分类,我们实现了一种基于检索的技术,将测试日志与最相似的标准日志进行对比。该策略不仅避免了日志特定训练的需求,还巧妙地融入了Token级别信息,确保了精细且鲁棒的检测,特别是对于未见过的日志。我们还提出了核心集技术,可降低比较所需的计算成本。实验结果表明,即使未在日志数据上进行训练,RAPID仍展现出与先前模型相媲美的竞争性能,并在某些数据集上达到了最优性能。通过多种研究问题,我们验证了其无延迟实时检测的能力。