Large language models (LLMs) can respond to human language queries and have shown powerful potential applications in network operations (NetOps). Thanks to the large amount of commonsense knowledge inherent, LLMs achieve much better inference accuracy than traditional models and emerge with strong abilities in generalization, reasoning, and code generation. These abilities may have a crucial boost to automated and intelligent NetOps. However, it remains under-explored how well LLMs perform in various NetOps tasks. In this work, we make a systematic assessment of the capabilities, strengths, and limitations of selected LLMs in the field of NetOps. The evaluation is conducted on a collection of 5,732 questions about NetOps, encompassing 26 publicly available general-domain LLMs, including ChatGPT, LLaMA, Falcon, etc. We also finetune some of these LLMs with our collected NetOps corpus and evaluate the resulting models. The evaluation method follows the widely adopted benchmarks for general-domain LLMs, combined with Chain-of-Thought Prompts and Retrieval-Augmented Generation. The results show that only GPT-4 achieves high accuracy equivalent to passing the NetOps certification exam for humans, while all the other LLMs have much lower accuracy. However, some open models like LLaMA 2 still demonstrate significant potential. Furthermore, we evaluate the impact of factors such as model parameters, prompt engineering, instruction fine-tuning etc. This work shall be treated as the initial effort to systematic evaluation of LLMs in NetOps, and a more rigorous study is required for production use. The evaluation code and dataset will be released to benefit future research.
翻译:大语言模型能够响应人类语言查询,并在网络运维领域展现出强大的应用潜力。凭借其内置的丰富常识知识,大语言模型比传统模型实现了更高的推理准确性,并涌现出强大的泛化、推理和代码生成能力。这些能力可能对自动化智能网络运维产生关键推动作用。然而,大语言模型在各类网络运维任务中的具体表现仍有待深入探索。本研究系统评估了选定大语言模型在网络运维领域的能力、优势与局限性。评估基于涵盖网络运维领域的5,732个问题数据集,涉及包括ChatGPT、LLaMA、Falcon等在内的26个公开通用型大语言模型。我们还使用自建网络运维语料库对部分模型进行微调并评估其性能。评估方法遵循通用型大语言模型的广泛基准测试框架,结合思维链提示与检索增强生成技术。结果显示,仅有GPT-4达到与人类通过网络运维认证考试相当的高准确率,而其他所有大语言模型准确率均显著偏低。值得注意的是,以LLaMA 2为代表的某些开源模型仍展现出显著潜力。此外,我们评估了模型参数规模、提示工程、指令微调等影响因素。本研究可视为大语言模型网络运维系统性评估的初步探索,面向生产环境的应用仍需更严谨的研究。评估代码与数据集将公开发布以促进后续研究。