Logs produced by extensive software systems are integral to monitoring system behaviors. Advanced log analysis facilitates the detection, alerting, and diagnosis of system faults. Log parsing, which entails transforming raw log messages into structured templates, constitutes a critical phase in the automation of log analytics. Existing log parsers fail to identify the correct templates due to reliance on human-made rules. Besides, These methods focus on statistical features while ignoring semantic information in log messages. To address these challenges, we introduce a cutting-edge \textbf{L}og parsing framework with \textbf{E}ntropy sampling and Chain-of-Thought \textbf{M}erging (Lemur). Specifically, to discard the tedious manual rules. We propose a novel sampling method inspired by information entropy, which efficiently clusters typical logs. Furthermore, to enhance the merging of log templates, we design a chain-of-thought method for large language models (LLMs). LLMs exhibit exceptional semantic comprehension, deftly distinguishing between parameters and invariant tokens. We have conducted experiments on large-scale public datasets. Extensive evaluation demonstrates that Lemur achieves the state-of-the-art performance and impressive efficiency.
翻译:大型软件系统产生的日志是监控系统行为的重要组成部分。高级日志分析有助于检测、告警和诊断系统故障。日志解析作为将原始日志消息转换为结构化模板的关键环节,是日志分析自动化的核心步骤。现有日志解析器因依赖人工规则而难以识别正确模板,且这些方法多关注统计特征而忽略日志消息中的语义信息。为解决上述问题,我们提出一种创新的**L**og解析框架,采用**E**ntropy采样与思维链**M**erge方法(Lemur)。具体而言,为摒弃繁琐的人工规则,我们提出一种受信息熵启发的采样方法,可高效聚类典型日志。此外,为优化日志模板合并过程,我们设计了一种面向大语言模型(LLM)的思维链方法。LLM展现出卓越的语义理解能力,能精准区分参数项与不变标记。我们在大规模公共数据集上开展实验,全面评估表明Lemur达到了最先进性能与出色效率。