There is a rapidly growing interest in using Large Language Models (LLMs) to automate complex network operations, but their reliable adoption requires rigorous assessment of their effectiveness and safety. Existing benchmarks do not address whether LLMs can successfully resolve errors in large-scale, interdependent network configurations without introducing new disruptions. Developing such a benchmark is challenging: scenarios must be diverse and increasingly complex, yet their evaluation must be straightforward and meaningful. In this paper, we present Cornetto, the first benchmark to evaluate LLM-driven network configuration repair functionally and at scale. Cornetto features a generation pipeline that synthesizes representative and plausible misconfiguration scenarios, coupled with an evaluation framework that uses formal verification to assess functional correctness of proposed fixes against ground-truth specifications. Using this pipeline, we synthesize a dataset of 231 problems for fixing configurations across varying network topologies (20--754 nodes) and diverse protocols. We evaluate 9 state-of-the-art LLMs and find that while they show promise, they often introduce regressions and their performance degrades at scale. Our results indicate that reliable LLM-powered network automation requires integrating LLMs into iterative workflows guided by formal verification.
翻译:大型语言模型(LLMs)在自动化复杂网络操作中的应用日益受到关注,但其可靠部署需要对其有效性和安全性进行严格评估。现有基准测试并未探究LLMs能否在不引发新故障的前提下,成功解决大规模、相互依赖的网络配置中的错误。构建此类基准测试颇具挑战性:场景需具有多样性且复杂性递增,同时评估方法必须直接且有意义。本文提出Cornetto——首个从功能层面及大规模维度评估LLM驱动网络配置修复的基准测试。Cornetto配备一个生成管道,可合成典型且真实的误配置场景;同时,其评估框架采用形式化验证技术,依据真值规范来评估修复方案的功能正确性。通过该管道,我们合成了一个包含231个问题的数据集,覆盖不同网络拓扑(20-754节点)及多种协议。我们评估了9个最先进的LLM,发现尽管它们展现出潜力,但常引入回归问题,且在大规模场景下性能下降。研究结果表明,可靠的LLM驱动网络自动化需将LLMs整合到由形式化验证引导的迭代工作流中。