AI coding agents demonstrate strong performance on general-purpose software benchmarks. However, their ability to handle 5G network engineering tasks remains unexplored. We propose SWE-Bench~5G, the first benchmark designed to investigate whether AI coding agents can resolve real-world bugs in 5G core network software. The benchmark collects task instances from three open-source 5G projects, packages each as a self-contained Docker environment with automated fail-to-pass tests, and provides a dual test strategy tailored to the complex runtime dependencies of telecom code. In addition, for instances whose original issues reference 3GPP specification clauses, we construct concise specification context documents, enabling controlled evaluation of whether domain knowledge improves agent performance. Experiments on four LLMs reveal that all models diagnose bugs at rates exceeding 91\%, yet resolve rates remain between 10\% and 30\%, suggesting that both iterative code editing capability and domain knowledge play important roles. The specification injection experiment further confirms that 3GPP excerpts improve resolve rates on specification-dependent bugs, while the gains on generic defensive checks remain limited, indicating that the effect of domain knowledge is conditional on bug type.
翻译:AI编程智能体在通用软件基准测试中展现出强大性能,但其处理5G网络工程任务的能力尚未被探索。我们提出SWE-Bench~5G,这是首个旨在评估AI编程智能体能否解决5G核心网络软件中真实世界错误的基准测试。该基准测试从三个开源5G项目中收集任务实例,将每个实例打包为自包含的Docker环境并配备自动化失败到通过测试,同时提供针对电信代码复杂运行时依赖关系设计的双测试策略。此外,针对原始问题引用了3GPP规范条款的实例,我们构建了简洁的规范上下文文档,从而能够可控地评估领域知识是否提升智能体性能。基于四个大语言模型的实验表明,所有模型诊断错误的成功率超过91%,但修复成功率仅介于10%至30%之间,这表明迭代代码编辑能力与领域知识均发挥重要作用。规范注入实验进一步证实,3GPP摘录可提升规范依赖型错误的修复率,但对通用防御性检查的提升效果有限,这表明领域知识的效果具有错误类型条件性。