Agents powered by large language models (LLMs) are increasingly adopted in the software industry, contributing code as collaborators or even autonomous developers. As their presence grows, it becomes important to assess the current boundaries of their coding abilities. Existing agentic coding benchmarks, however, cover a limited task scope, e.g., bug fixing within a single pull request (PR), and often rely on non-executable evaluations or lack an automated approach for continually updating the evaluation coverage. To address such issues, we propose FeatureBench, a benchmark designed to evaluate agentic coding performance in end-to-end, feature-oriented software development. FeatureBench incorporates an execution-based evaluation protocol and a scalable test-driven method that automatically derives tasks from code repositories with minimal human effort. By tracing from unit tests along a dependency graph, our approach can identify feature-level coding tasks spanning multiple commits and PRs scattered across the development timeline, while ensuring the proper functioning of other features after the separation. Using this framework, we curated 200 challenging evaluation tasks and 3825 executable environments from 24 open-source repositories in the first version of our benchmark. Empirical evaluation reveals that the state-of-the-art agentic model, such as Claude 4.5 Opus, which achieves a 74.4% resolved rate on SWE-bench, succeeds on only 11.0% of tasks, opening new opportunities for advancing agentic coding. Moreover, benefiting from our automated task collection toolkit, FeatureBench can be easily scaled and updated over time to mitigate data leakage. The inherent verifiability of constructed environments also makes our method potentially valuable for agent training.
翻译:基于大语言模型(LLM)驱动的智能体在软件行业中的应用日益广泛,它们作为协作者甚至自主开发者参与代码贡献。随着其参与度的提升,评估其当前编码能力的边界变得尤为重要。然而,现有的智能体编码基准测试任务范围有限,例如仅涵盖单个拉取请求(PR)内的错误修复,且常依赖非可执行的评估方法,或缺乏持续更新评估覆盖范围的自动化手段。为应对这些问题,我们提出了特征基准(FeatureBench),这是一个专为评估端到端、面向功能的软件开发中智能体编码性能而设计的基准测试。特征基准采用基于执行的评估协议和可扩展的测试驱动方法,能够以最小的人工成本从代码仓库中自动提取任务。通过沿依赖图追踪单元测试,我们的方法能够识别跨越开发时间线中多个提交和PR的功能级编码任务,同时确保在任务分离后其他功能的正常运行。利用该框架,我们在基准测试的首个版本中从24个开源仓库中构建了200项具有挑战性的评估任务和3825个可执行环境。实证评估表明,当前最先进的智能体模型(如Claude 4.5 Opus)在SWE-bench上达到74.4%的解决率,但在本基准任务中仅成功完成11.0%,这为推进智能体编码能力的发展提供了新的机遇。此外,得益于我们自动化的任务收集工具包,特征基准能够轻松实现规模扩展和持续更新,从而缓解数据泄露问题。所构建环境固有的可验证性也使我们的方法对智能体训练具有潜在价值。