Large language models (LLMs) have exhibited remarkable capabilities in NLP-related tasks such as translation, summarizing, and generation. The application of LLMs in specific areas, notably AIOps (Artificial Intelligence for IT Operations), holds great potential due to their advanced abilities in information summarizing, report analyzing, and ability of API calling. Nevertheless, the performance of current LLMs in AIOps tasks is yet to be determined. Furthermore, a comprehensive benchmark is required to steer the optimization of LLMs tailored for AIOps. Compared with existing benchmarks that focus on evaluating specific fields like network configuration, in this paper, we present \textbf{OpsEval}, a comprehensive task-oriented AIOps benchmark designed for LLMs. For the first time, OpsEval assesses LLMs' proficiency in three crucial scenarios (Wired Network Operation, 5G Communication Operation, and Database Operation) at various ability levels (knowledge recall, analytical thinking, and practical application). The benchmark includes 7,200 questions in both multiple-choice and question-answer (QA) formats, available in English and Chinese. With quantitative and qualitative results, we show how various LLM tricks can affect the performance of AIOps, including zero-shot, chain-of-thought, and few-shot in-context learning. We find that GPT4-score is more consistent with experts than widely used Bleu and Rouge, which can be used to replace automatic metrics for large-scale qualitative evaluations.
翻译:大语言模型(LLMs)在翻译、摘要和生成等NLP相关任务中展现出卓越能力。其在特定领域的应用,特别是AIOps(人工智能运维),因具备信息总结、报告分析和API调用等高级功能而潜力巨大。然而,当前LLMs在AIOps任务中的表现仍有待验证,且需要建立全面的基准测试来指导针对AIOps定制的LLM优化。与现有聚焦于网络配置等特定领域的基准测试不同,本文提出**OpsEval**——首个面向LLMs的综合任务导向型AIOps基准测试。OpsEval首次从知识回忆、分析思维和实际应用三个能力层级,评估LLMs在三个关键场景(有线网络运维、5G通信运维和数据库运维)中的熟练程度。该基准包含7,200道选择题和问答(QA)题,并提供英文和中文版本。通过定量与定性结果,我们展示了零样本、思维链和少样本上下文学习等不同LLM技巧如何影响AIOps性能。研究发现,GPT4评分比广泛使用的Bleu和Rouge指标与专家评判的一致性更高,可替代自动指标用于大规模定性评估。