LogEval: A Comprehensive Benchmark Suite for Large Language Models In Log Analysis

Log analysis is crucial for ensuring the orderly and stable operation of information systems, particularly in the field of Artificial Intelligence for IT Operations (AIOps). Large Language Models (LLMs) have demonstrated significant potential in natural language processing tasks. In the AIOps domain, they excel in tasks such as anomaly detection, root cause analysis of faults, operations and maintenance script generation, and alert information summarization. However, the performance of current LLMs in log analysis tasks remains inadequately validated. To address this gap, we introduce LogEval, a comprehensive benchmark suite designed to evaluate the capabilities of LLMs in various log analysis tasks for the first time. This benchmark covers tasks such as log parsing, log anomaly detection, log fault diagnosis, and log summarization. LogEval evaluates each task using 4,000 publicly available log data entries and employs 15 different prompts for each task to ensure a thorough and fair assessment. By rigorously evaluating leading LLMs, we demonstrate the impact of various LLM technologies on log analysis performance, focusing on aspects such as self-consistency and few-shot contextual learning. We also discuss findings related to model quantification, Chinese-English question-answering evaluation, and prompt engineering. These findings provide insights into the strengths and weaknesses of LLMs in multilingual environments and the effectiveness of different prompt strategies. Various evaluation methods are employed for different tasks to accurately measure the performance of LLMs in log analysis, ensuring a comprehensive assessment. The insights gained from LogEvals evaluation reveal the strengths and limitations of LLMs in log analysis tasks, providing valuable guidance for researchers and practitioners.

翻译：日志分析对于保障信息系统有序稳定运行至关重要，尤其在智能运维领域。大语言模型在自然语言处理任务中展现出巨大潜力，在AIOps领域已成功应用于异常检测、故障根因分析、运维脚本生成及告警信息摘要等任务。然而，当前大语言模型在日志分析任务中的性能尚未得到充分验证。为此，我们首次提出LogEval——一个全面评估大语言模型在各类日志分析任务能力的基准测试套件。该基准涵盖日志解析、日志异常检测、日志故障诊断和日志摘要等任务，每个任务采用4,000条公开日志数据，并设计15种不同提示语以确保评估的全面性与公平性。通过对主流大语言模型的严格评估，我们揭示了自洽性、少样本上下文学习等技术对日志分析性能的影响，并探讨了模型量化、中英文问答评估及提示工程等方面的发现。这些发现为理解大语言模型在多语言环境中的优劣特性及不同提示策略的有效性提供了重要依据。针对不同任务采用多样化评估方法，精确衡量大语言模型在日志分析中的性能表现，确保评估的全面性。LogEval的评估结果揭示了大语言模型在日志分析任务中的优势与局限，为研究者和实践者提供了有价值的指导。