OpsEval: A Comprehensive IT Operations Benchmark Suite for Large Language Models

Yuhe Liu,Changhua Pei,Longlong Xu,Bohan Chen,Mingze Sun,Zhirui Zhang,Yongqian Sun,Shenglin Zhang,Kun Wang,Haiming Zhang,Jianhui Li,Gaogang Xie,Xidao Wen,Xiaohui Nie,Minghua Ma,Dan Pei

Information Technology (IT) Operations (Ops), particularly Artificial Intelligence for IT Operations (AIOps), is the guarantee for maintaining the orderly and stable operation of existing information systems. According to Gartner's prediction, the use of AI technology for automated IT operations has become a new trend. Large language models (LLMs) that have exhibited remarkable capabilities in NLP-related tasks, are showing great potential in the field of AIOps, such as in aspects of root cause analysis of failures, generation of operations and maintenance scripts, and summarizing of alert information. Nevertheless, the performance of current LLMs in Ops tasks is yet to be determined. In this paper, we present OpsEval, a comprehensive task-oriented Ops benchmark designed for LLMs. For the first time, OpsEval assesses LLMs' proficiency in various crucial scenarios at different ability levels. The benchmark includes 7184 multi-choice questions and 1736 question-answering (QA) formats in English and Chinese. By conducting a comprehensive performance evaluation of the current leading large language models, we show how various LLM techniques can affect the performance of Ops, and discussed findings related to various topics, including model quantification, QA evaluation, and hallucination issues. To ensure the credibility of our evaluation, we invite dozens of domain experts to manually review our questions. At the same time, we have open-sourced 20% of the test QA to assist current researchers in preliminary evaluations of their OpsLLM models. The remaining 80% of the data, which is not disclosed, is used to eliminate the issue of the test set leakage. Additionally, we have constructed an online leaderboard that is updated in real-time and will continue to be updated, ensuring that any newly emerging LLMs will be evaluated promptly. Both our dataset and leaderboard have been made public.

翻译：信息技术（IT）运维，尤其是智能运维（AIOps），是保障现有信息系统有序稳定运行的关键。根据Gartner预测，利用AI技术实现自动化IT运维已成为新趋势。在自然语言处理任务中展现出卓越能力的大型语言模型（LLM），在智能运维领域，如故障根因分析、运维脚本生成及告警信息总结等方面，正显示出巨大潜力。然而，当前LLM在运维任务中的实际表现仍有待验证。本文提出OpsEval——首个面向LLM的综合性任务型运维基准测试套件。OpsEval首次从不同能力维度评估LLM在多个关键场景中的表现，包含7184道多选题和1736道中英文问答（QA）题。通过对当前主流大语言模型的全面性能评估，我们揭示了不同LLM技术对运维任务性能的影响，并围绕模型量化、问答评估及幻觉问题等议题展开讨论。为确保评估可信度，我们邀请数十名领域专家人工审核测试题，同时开源20%的测试问答数据以支持研究人员对其运维LLM模型进行初步评估。剩余80%数据不予公开，以消除测试集泄露问题。此外，我们构建了实时更新的在线排行榜，将持续收录新兴LLM的评估结果。数据集与排行榜均已向公众开放。