Logs produced by extensive software systems are integral to monitoring system behaviors. Advanced log analysis facilitates the detection, alerting, and diagnosis of system faults. Log parsing, which entails transforming raw log messages into structured templates, constitutes a critical phase in the automation of log analytics. Existing log parsers fail to identify the correct templates due to reliance on human-made rules. Besides, These methods focus on statistical features while ignoring semantic information in log messages. To address these challenges, we introduce a cutting-edge \textbf{L}og parsing framework with \textbf{E}ntropy sampling and Chain-of-Thought \textbf{M}erging (Lemur). Specifically, to discard the tedious manual rules. We propose a novel sampling method inspired by information entropy, which efficiently clusters typical logs. Furthermore, to enhance the merging of log templates, we design a chain-of-thought method for large language models (LLMs). LLMs exhibit exceptional semantic comprehension, deftly distinguishing between parameters and invariant tokens. We have conducted experiments on large-scale public datasets. Extensive evaluation demonstrates that Lemur achieves the state-of-the-art performance and impressive efficiency.
翻译:大规模软件系统生成的日志对于监测系统行为至关重要。先进的日志分析技术有助于系统故障的检测、告警和诊断。日志解析作为将原始日志消息转化为结构化模板的关键环节,是日志分析自动化的重要基础。现有日志解析器因依赖人工规则而无法正确识别模板,且这些方法仅关注统计特征,忽略了日志消息中的语义信息。为解决上述问题,我们提出了基于熵采样与思维链合并的创新日志解析框架Lemur。具体而言,为摒弃繁琐的人工规则,我们设计了一种受信息熵启发的采样方法,能够高效聚类典型日志。此外,为优化日志模板合并过程,我们构建了面向大语言模型的思维链方法。大语言模型凭借其卓越的语义理解能力,能精准区分参数标记与不变标记。基于大规模公开数据集的实验表明,Lemur实现了最先进的性能与卓越的效率。