OpsEval: A Comprehensive IT Operations Benchmark Suite for Large Language Models

Yuhe Liu,Changhua Pei,Longlong Xu,Bohan Chen,Mingze Sun,Zhirui Zhang,Yongqian Sun,Shenglin Zhang,Kun Wang,Haiming Zhang,Jianhui Li,Gaogang Xie,Xidao Wen,Xiaohui Nie,Minghua Ma,Dan Pei

Information Technology (IT) Operations (Ops), particularly Artificial Intelligence for IT Operations (AIOps), is the guarantee for maintaining the orderly and stable operation of existing information systems. According to Gartner's prediction, the use of AI technology for automated IT operations has become a new trend. Large language models (LLMs) that have exhibited remarkable capabilities in NLP-related tasks, are showing great potential in the field of AIOps, such as in aspects of root cause analysis of failures, generation of operations and maintenance scripts, and summarizing of alert information. Nevertheless, the performance of current LLMs in Ops tasks is yet to be determined. In this paper, we present OpsEval, a comprehensive task-oriented Ops benchmark designed for LLMs. For the first time, OpsEval assesses LLMs' proficiency in various crucial scenarios at different ability levels. The benchmark includes 7184 multi-choice questions and 1736 question-answering (QA) formats in English and Chinese. By conducting a comprehensive performance evaluation of the current leading large language models, we show how various LLM techniques can affect the performance of Ops, and discussed findings related to various topics, including model quantification, QA evaluation, and hallucination issues. To ensure the credibility of our evaluation, we invite dozens of domain experts to manually review our questions. At the same time, we have open-sourced 20% of the test QA to assist current researchers in preliminary evaluations of their OpsLLM models. The remaining 80% of the data, which is not disclosed, is used to eliminate the issue of the test set leakage. Additionally, we have constructed an online leaderboard that is updated in real-time and will continue to be updated, ensuring that any newly emerging LLMs will be evaluated promptly. Both our dataset and leaderboard have been made public.

翻译：信息技术（IT）运维（Ops），特别是面向IT运维的人工智能（AIOps），是维持现有信息系统有序稳定运行的保障。根据Gartner的预测，利用AI技术实现自动化IT运维已成为新趋势。在自然语言处理相关任务中展现出卓越能力的大语言模型（LLMs），在AIOps领域（如故障根因分析、运维脚本生成、告警信息摘要等方面）正显示出巨大潜力。然而，当前LLMs在运维任务中的性能尚待明确。本文提出了OpsEval，一个专为LLMs设计的综合性任务导向型运维基准测试套件。OpsEval首次从不同能力层级评估LLMs在多种关键场景中的熟练程度。该基准包含7184道选择题和1736道问答格式题目，涵盖中英文双语。通过对当前主流大语言模型进行全面性能评估，我们揭示了不同LLM技术如何影响运维任务表现，并探讨了与模型量化、问答评估及幻觉问题相关的发现。为确保评估的可信度，我们邀请了数十位领域专家对题目进行人工审核。同时，我们开源了20%的测试问答数据，以帮助当前研究者对其运维大语言模型进行初步评估。未公开的剩余80%数据用于规避测试集泄露问题。此外，我们构建了实时更新并持续维护的在线排行榜，确保任何新出现的LLMs都能得到及时评估。我们的数据集与排行榜均已公开。