With the telecommunications field embracing zero touch management alongside novel O-RAN and AI-RAN frameworks, contemporary telecom networks now function as immensely intricate and heavily softwareized codebases. While automated software engineering (ASE) tools and Software Engineering (SWE) Agents hold the potential to alleviate the critical code generation bottleneck in this domain, their ability to navigate and modify specialized, mathematically rigorous wireless stacks like srsRAN 5G remains unverified. General-purpose coding benchmarks fail to capture the stateful logic and strict requirements of telecommunications, leaving a critical evaluation gap. In this paper, we introduce TeleSWEBench, the first commit-driven benchmark specifically designed to measure an agent's performance in the telecom domain. We mine real developer commits from the srsRAN 5G repository and distill them into structured test cases across three difficulty tiers (Easy, Medium, and Difficult). Our benchmark consists of 734 questions that are accompanied by executable unit tests. To avoid the rigidity of test cases, we further propose a hierarchical LLM as a Judge framework called TeleJudge that scores agent outputs at the file level and aggregates verdicts holistically. This follows an evaluation based on context and semantic similarity in parallel to a standard unit test-based evaluation. Using this benchmark, we evaluate AIDER, OpenHands, and the ClaudeCode frameworks, powered by state-of-the-art reasoning LLMs, including Qwen3, GPT OSS, Gemma 4, Kimi, and Qwencoder 2.5. Our two-stage evaluation reveals that models suffer from a lack of both localization accuracy and functional correctness, with the strongest ASE tools achieving up to 25% of shippable changes.
翻译:随着电信领域拥抱零接触管理,并结合新型O-RAN和AI-RAN框架,现代电信网络已演变为极其复杂且高度软件化的代码库。尽管自动化软件工程工具和软件工程代理有望缓解该领域中的关键代码生成瓶颈,但它们能否驾驭并修改诸如srsRAN 5G等专业化、数学严谨的无线协议栈,仍有待验证。通用编程基准测试无法捕捉电信领域中的有状态逻辑和严格需求,导致关键评估环节缺失。本文提出TeleSWEBench——首个专门为衡量代理在电信领域性能而设计的提交驱动基准测试。我们从srsRAN 5G代码库中挖掘真实开发者提交,并将其提炼为涵盖三个难度级别(简单、中等、困难)的结构化测试用例。该基准测试包含734个问题,并附带可执行的单元测试。为避免测试用例的刻板性,我们进一步提出一种名为TeleJudge的分层式大语言模型评审框架,该框架在文件层面为代理输出评分,并整体聚合评审结果。该评估基于上下文和语义相似性,与标准的单元测试评估并行进行。利用该基准测试,我们评估了基于最先进推理大语言模型(包括Qwen3、GPT OSS、Gemma 4、Kimi和Qwencoder 2.5)的AIDER、OpenHands和ClaudeCode框架。我们的两阶段评估揭示,模型在定位准确性和功能正确性方面均存在不足,其中最强自动化软件工程工具仅能实现高达25%的可交付代码变更。