Large language models (LLMs) have exhibited remarkable capabilities in NLP-related tasks such as translation, summarizing, and generation. The application of LLMs in specific areas, notably AIOps (Artificial Intelligence for IT Operations), holds great potential due to their advanced abilities in information summarizing, report analyzing, and ability of API calling. Nevertheless, the performance of current LLMs in AIOps tasks is yet to be determined. Furthermore, a comprehensive benchmark is required to steer the optimization of LLMs tailored for AIOps. Compared with existing benchmarks that focus on evaluating specific fields like network configuration, in this paper, we present \textbf{OpsEval}, a comprehensive task-oriented AIOps benchmark designed for LLMs. For the first time, OpsEval assesses LLMs' proficiency in three crucial scenarios (Wired Network Operation, 5G Communication Operation, and Database Operation) at various ability levels (knowledge recall, analytical thinking, and practical application). The benchmark includes 7,200 questions in both multiple-choice and question-answer (QA) formats, available in English and Chinese. With quantitative and qualitative results, we show how various LLM tricks can affect the performance of AIOps, including zero-shot, chain-of-thought, and few-shot in-context learning. We find that GPT4-score is more consistent with experts than widely used Bleu and Rouge, which can be used to replace automatic metrics for large-scale qualitative evaluations.
翻译:大语言模型(LLM)在翻译、摘要和生成等自然语言处理任务中展现出卓越能力。由于其在信息摘要、报告分析及API调用方面的先进能力,LLM在特定领域(尤其是智能运维,AIOps)的应用具有巨大潜力。然而,当前LLM在AIOps任务中的性能尚待明确。此外,需要构建综合性基准测试来引导面向AIOps的LLM优化。与现有聚焦特定领域(如网络配置)的基准测试不同,本文提出\textbf{OpsEval}——首个面向LLM的综合性任务导向型AIOps基准测试。OpsEval首次评估LLM在三个关键场景(有线网络运维、5G通信运维、数据库运维)中不同能力层级(知识记忆、分析思维、实际应用)的熟练程度。该基准测试包含7200道题目,涵盖选择题和问答两种格式,并提供英文和中文版本。通过定量与定性结果,我们展示了多种LLM技巧(包括零样本、思维链和少样本上下文学习)如何影响AIOps性能。研究发现,GPT4评分比广泛使用的Bleu和Rouge指标与专家评估更具一致性,可替代自动化指标用于大规模定性评估。