The process of log parsing, which converts log messages into structured formats, is a crucial step for various log analysis tasks. Although numerous log parsers have been proposed, their effectiveness on complex log data is often hindered due to reliance on human-made rules or learning-based models with limited training data. The recent rise of powerful large language models (LLMs) shows potential for log parsing due to their extensive pre-trained knowledge related to code and logging. However, their accuracy is currently limited due to the lack of specialized log parsing capabilities. Additionally, the inconsistency of their answers and significant overhead obstruct the practical implementation of LLM-based log parsing. To tackle these challenges, we introduce LLMParser, the first practical LLM-based log parsing framework. LLMParser enables accurate and robust log parsing by leveraging the in-context learning (ICL) capability of the LLM, employing a hierarchical candidate sampling algorithm, and selecting high-quality demonstrations. LLMParser also includes a novel adaptive parsing cache component to store and refine the templates generated by the LLM. This design aids in addressing the inefficiency of LLMs by rapid matching to previously parsed log templates. LLMParser also adaptively updates the templates in the parsing cache to ensure consistent parsed results. Extensive evaluation on large-scale public datasets demonstrates that LLMParser surpasses the state-of-the-art methods. Furthermore, LLMParser significantly reduces the query times to LLMs, achieving efficiency comparable to the most efficient baseline, Drain.
翻译:日志解析是将日志消息转换为结构化格式的过程,这是各类日志分析任务的关键步骤。尽管已有大量日志解析器被提出,但由于依赖人工规则或训练数据有限的基于学习的模型,它们在处理复杂日志数据时的有效性常常受到制约。近年来,强大的大语言模型(LLM)因其与代码及日志相关的广泛预训练知识而展现出在日志解析领域的潜力。然而,由于缺乏专门的日志解析能力,其准确性目前仍然有限。此外,答案的不一致性以及显著的开销阻碍了基于LLM的日志解析的实际应用。为应对这些挑战,我们提出了LLMParser,这是首个实用的基于LLM的日志解析框架。LLMParser通过利用LLM的上下文学习(ICL)能力、采用分层候选采样算法以及选择高质量的示例,实现了准确且鲁棒的日志解析。LLMParser还包含一个新颖的自适应解析缓存组件,用于存储和优化LLM生成的模板。该设计通过快速匹配之前解析过的日志模板,有助于解决LLM在效率方面的问题。LLMParser还会自适应地更新解析缓存中的模板,以确保解析结果的一致性。在大规模公共数据集上的广泛评估表明,LLMParser超越了现有最先进的方法。此外,LLMParser显著减少了查询LLM的次数,其效率可与最有效的基线方法Drain相媲美。