Logs serve as a primary source of information for engineers to diagnose failures in large-scale online service systems. Log parsing, which extracts structured events from massive unstructured log data, is a critical first step for downstream tasks like anomaly detection and failure diagnosis. With advances in large language models (LLMs), leveraging their strong text understanding capabilities has proven effective for accurate log parsing. However, existing LLM-based log parsers all focus on the constant part of logs, ignoring the potential contribution of the variable part to log parsing. This constant-centric strategy brings four key problems. First, inefficient log grouping and sampling with only constant information. Second, a relatively large number of LLM invocations due to constant-based cache, leading to low log parsing accuracy and efficiency. Third, a relatively large number of consumed constant tokens in prompts leads to high LLM invocation costs. At last, these methods only retain placeholders in the results, losing the system visibility brought by variable information in logs. Facing these problems, we propose a variable-centric log parsing strategy named VarParser. Through variable contribution sampling, variable-centric parsing cache, and adaptive variable-aware in-context learning, our approach can efficiently capture the variable parts of logs and leverage their contributions to parsing. By introducing variable units, we preserve rich variable information, enhancing the integrity of log parsing results. Extensive evaluations on large-scale datasets demonstrate that VarParser achieves higher accuracy compared to existing methods, significantly improving parsing efficiency while reducing the LLM invocation costs.
翻译:日志是工程师诊断大规模在线服务系统故障的主要信息来源。日志解析从海量非结构化日志数据中提取结构化事件,是异常检测与故障诊断等下游任务的关键第一步。随着大语言模型(LLM)的发展,利用其强大的文本理解能力已被证明能实现精确的日志解析。然而,现有基于LLM的日志解析器均聚焦于日志的常量部分,忽视了变量部分对日志解析的潜在贡献。这种以常量为中心的策略带来四个关键问题:第一,仅依赖常量信息导致日志分组与采样效率低下;第二,基于常量的缓存机制导致LLM调用次数相对较多,从而降低日志解析的准确性与效率;第三,提示中消耗的常量标记数量较多,导致LLM调用成本高昂;最后,这些方法在结果中仅保留占位符,丢失了日志中变量信息带来的系统可见性。针对这些问题,我们提出一种以变量为中心的日志解析策略——VarParser。通过变量贡献采样、以变量为中心的解析缓存以及自适应变量感知的上下文学习,我们的方法能高效捕获日志的变量部分并利用其对解析的贡献。通过引入变量单元,我们保留了丰富的变量信息,增强了日志解析结果的完整性。在大规模数据集上的广泛评估表明,与现有方法相比,VarParser实现了更高的准确率,在显著提升解析效率的同时降低了LLM调用成本。