Large language models (LLMs) can respond to human language queries and have shown powerful potential applications in network operations (NetOps). Thanks to the large amount of commonsense knowledge inherent, LLMs achieve much better inference accuracy than traditional models and emerge with strong abilities in generalization, reasoning, and code generation. These abilities may have a crucial boost to automated and intelligent NetOps. However, it remains under-explored how well LLMs perform in various NetOps tasks. In this work, we make a systematic assessment of the capabilities, strengths, and limitations of selected LLMs in the field of NetOps. The evaluation is conducted on a collection of 5,732 questions about NetOps, encompassing 26 publicly available general-domain LLMs, including ChatGPT, LLaMA, Falcon, etc. We also finetune some of these LLMs with our collected NetOps corpus and evaluate the resulting models. The evaluation method follows the widely adopted benchmarks for general-domain LLMs, combined with Chain-of-Thought Prompts and Retrieval-Augmented Generation. The results show that only GPT-4 achieves high accuracy equivalent to passing the NetOps certification exam for humans, while all the other LLMs have much lower accuracy. However, some open models like LLaMA 2 still demonstrate significant potential. Furthermore, we evaluate the impact of factors such as model parameters, prompt engineering, instruction fine-tuning etc. This work shall be treated as the initial effort to systematic evaluation of LLMs in NetOps, and a more rigorous study is required for production use. The evaluation code and dataset will be released to benefit future research.
翻译:大语言模型(LLMs)可响应人类语言查询,并已在网络运维领域展现出强大的潜在应用价值。得益于其内置的海量常识知识,LLMs的推理准确率远优于传统模型,并涌现出强大的泛化、推理与代码生成能力。这些能力可能对自动化与智能化网络运维产生关键推动作用。然而,LLMs在不同网络运维任务中的表现尚未得到充分探索。本研究系统评估了所选LLMs在网络运维领域的综合能力、优势与局限。我们基于包含5,732个网络运维问题的数据集展开评估,涉及26个公开通用域LLMs(包括ChatGPT、LLaMA、Falcon等)。同时,我们利用自建网络运维语料对部分模型进行微调,并对微调结果进行评测。评估方法遵循通用域LLMs的广泛采用基准,结合思维链提示与检索增强生成技术。结果表明,仅GPT-4达到相当于人类通过网络运维认证考试的高准确率,其他所有LLMs的准确率均显著偏低。但部分开源模型(如LLaMA 2)仍展现出巨大潜力。此外,我们评估了模型参数、提示工程、指令微调等影响因素。本研究可视为LLMs在网络运维领域系统性评估的初步探索,实际生产应用仍需更严格的研究。评估代码与数据集将公开发布以促进后续研究。