LLMParser: An Exploratory Study on Using Large Language Models for Log Parsing

Logs are important in modern software development with runtime information. Log parsing is the first step in many log-based analyses, that involve extracting structured information from unstructured log data. Traditional log parsers face challenges in accurately parsing logs due to the diversity of log formats, which directly impacts the performance of downstream log-analysis tasks. In this paper, we explore the potential of using Large Language Models (LLMs) for log parsing and propose LLMParser, an LLM-based log parser based on generative LLMs and few-shot tuning. We leverage four LLMs, Flan-T5-small, Flan-T5-base, LLaMA-7B, and ChatGLM-6B in LLMParsers. Our evaluation of 16 open-source systems shows that LLMParser achieves statistically significantly higher parsing accuracy than state-of-the-art parsers (a 96% average parsing accuracy). We further conduct a comprehensive empirical analysis on the effect of training size, model size, and pre-training LLM on log parsing accuracy. We find that smaller LLMs may be more effective than more complex LLMs; for instance where Flan-T5-base achieves comparable results as LLaMA-7B with a shorter inference time. We also find that using LLMs pre-trained using logs from other systems does not always improve parsing accuracy. While using pre-trained Flan-T5-base shows an improvement in accuracy, pre-trained LLaMA results in a decrease (decrease by almost 55% in group accuracy). In short, our study provides empirical evidence for using LLMs for log parsing and highlights the limitations and future research direction of LLM-based log parsers.

翻译：日志作为运行时信息，在现代软件开发中具有重要作用。日志解析是许多基于日志的分析任务的首要步骤，旨在从非结构化的日志数据中提取结构化信息。由于日志格式的多样性，传统日志解析器在准确解析日志方面面临挑战，这直接影响了下游日志分析任务的性能。本文探索了利用大语言模型进行日志解析的潜力，并提出了LLMParser——一种基于生成式大语言模型和少样本调优的日志解析器。我们在LLMParser中采用了四种大语言模型：Flan-T5-small、Flan-T5-base、LLaMA-7B和ChatGLM-6B。对16个开源系统的评估表明，LLMParser的解析准确率显著高于当前最优解析器（平均解析准确率达96%）。我们进一步对训练规模、模型尺寸以及预训练语言模型对日志解析准确率的影响进行了全面的实证分析。研究发现：较小规模的语言模型可能比较复杂的模型更为有效——例如Flan-T5-base在实现与LLaMA-7B相当结果的同时，推理时间更短；同时，使用其他系统日志预训练的语言模型并不总能提升解析准确率——虽然预训练后的Flan-T5-base准确率有所提升，但预训练LLaMA反而导致分组准确率下降近55%。简而言之，本研究为利用大语言模型进行日志解析提供了实证依据，并揭示了基于大语言模型的日志解析器的局限性与未来研究方向。