An Empirical Study of NetOps Capability of Pre-Trained Large Language Models

Large language models (LLMs) can respond to human language queries and have shown powerful potential applications in network operations (NetOps). Thanks to the large amount of commonsense knowledge inherent, LLMs achieve much better inference accuracy than traditional models and emerge with strong abilities in generalization, reasoning, and code generation. These abilities may have a crucial boost to automated and intelligent NetOps. However, it remains under-explored how well LLMs perform in various NetOps tasks. In this work, we make a systematic assessment of the capabilities, strengths, and limitations of selected LLMs in the field of NetOps. The evaluation is conducted on a collection of 5,732 questions about NetOps, encompassing 26 publicly available general-domain LLMs, including ChatGPT, LLaMA, Falcon, etc. We also finetune some of these LLMs with our collected NetOps corpus and evaluate the resulting models. The evaluation method follows the widely adopted benchmarks for general-domain LLMs, combined with Chain-of-Thought Prompts and Retrieval-Augmented Generation. The results show that only GPT-4 achieves high accuracy equivalent to passing the NetOps certification exam for humans, while all the other LLMs have much lower accuracy. However, some open models like LLaMA 2 still demonstrate significant potential. Furthermore, we evaluate the impact of factors such as model parameters, prompt engineering, instruction fine-tuning etc. This work shall be treated as the initial effort to systematic evaluation of LLMs in NetOps, and a more rigorous study is required for production use. The evaluation code and dataset will be released to benefit future research.

翻译：大语言模型（LLMs）可响应人类语言查询，并已在网络运维领域展现出强大的潜在应用价值。得益于其内置的海量常识知识，LLMs的推理准确率远优于传统模型，并涌现出强大的泛化、推理与代码生成能力。这些能力可能对自动化与智能化网络运维产生关键推动作用。然而，LLMs在不同网络运维任务中的表现尚未得到充分探索。本研究系统评估了所选LLMs在网络运维领域的综合能力、优势与局限。我们基于包含5,732个网络运维问题的数据集展开评估，涉及26个公开通用域LLMs（包括ChatGPT、LLaMA、Falcon等）。同时，我们利用自建网络运维语料对部分模型进行微调，并对微调结果进行评测。评估方法遵循通用域LLMs的广泛采用基准，结合思维链提示与检索增强生成技术。结果表明，仅GPT-4达到相当于人类通过网络运维认证考试的高准确率，其他所有LLMs的准确率均显著偏低。但部分开源模型（如LLaMA 2）仍展现出巨大潜力。此外，我们评估了模型参数、提示工程、指令微调等影响因素。本研究可视为LLMs在网络运维领域系统性评估的初步探索，实际生产应用仍需更严格的研究。评估代码与数据集将公开发布以促进后续研究。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/