OpenLogParser: Unsupervised Parsing with Open-Source Large Language Models

Log parsing is a critical step that transforms unstructured log data into structured formats, facilitating subsequent log-based analysis. Traditional syntax-based log parsers are efficient and effective, but they often experience decreased accuracy when processing logs that deviate from the predefined rules. Recently, large language models (LLM) based log parsers have shown superior parsing accuracy. However, existing LLM-based parsers face three main challenges: 1)time-consuming and labor-intensive manual labeling for fine-tuning or in-context learning, 2)increased parsing costs due to the vast volume of log data and limited context size of LLMs, and 3)privacy risks from using commercial models like ChatGPT with sensitive log information. To overcome these limitations, this paper introduces OpenLogParser, an unsupervised log parsing approach that leverages open-source LLMs (i.e., Llama3-8B) to enhance privacy and reduce operational costs while achieving state-of-the-art parsing accuracy. OpenLogParser first groups logs with similar static text but varying dynamic variables using a fixed-depth grouping tree. It then parses logs within these groups using three components: i)similarity scoring-based retrieval augmented generation: selects diverse logs within each group based on Jaccard similarity, helping the LLM distinguish between static text and dynamic variables; ii)self-reflection: iteratively query LLMs to refine log templates to improve parsing accuracy; and iii) log template memory: stores parsed templates to reduce LLM queries for improved parsing efficiency. Our evaluation on LogHub-2.0 shows that OpenLogParser achieves 25% higher parsing accuracy and processes logs 2.7 times faster compared to state-of-the-art LLM-based parsers. In short, OpenLogParser addresses privacy and cost concerns of using commercial LLMs while achieving state-of-the-arts parsing efficiency and accuracy.

翻译：日志解析是将非结构化日志数据转换为结构化格式的关键步骤，有助于后续基于日志的分析。传统的基于语法的日志解析器效率高且有效，但在处理偏离预定义规则的日志时，其准确性往往会下降。最近，基于大语言模型（LLM）的日志解析器展现出卓越的解析准确性。然而，现有的基于LLM的解析器面临三个主要挑战：1）用于微调或上下文学习的手动标注耗时耗力；2）由于日志数据量庞大且LLM的上下文大小有限，导致解析成本增加；3）使用ChatGPT等商业模型处理敏感日志信息带来的隐私风险。为克服这些限制，本文提出OpenLogParser，一种利用开源LLM（即Llama3-8B）的无监督日志解析方法，旨在增强隐私保护、降低运营成本，同时实现最先进的解析准确性。OpenLogParser首先使用固定深度分组树对具有相似静态文本但动态变量不同的日志进行分组。随后，通过三个组件解析这些组内的日志：i）基于相似性评分的检索增强生成：根据Jaccard相似性在每个组内选择多样化的日志，帮助LLM区分静态文本与动态变量；ii）自我反思：迭代查询LLM以优化日志模板，提高解析准确性；iii）日志模板记忆：存储已解析的模板，减少LLM查询次数，从而提升解析效率。我们在LogHub-2.0上的评估表明，与最先进的基于LLM的解析器相比，OpenLogParser实现了25%的解析准确性提升，且日志处理速度提高了2.7倍。简而言之，OpenLogParser在实现最先进解析效率与准确性的同时，有效解决了使用商业LLM所涉及的隐私与成本问题。