Large Language Models (LLMs) have shown strong capability in diverse software engineering tasks. However, feature-driven development, a highly prevalent real-world task that involves developing new functionalities for large, existing codebases, remains underexplored. We therefore introduce SWE-Dev, the first large-scale dataset (with 14,000 training and 500 test samples) designed to evaluate and train autonomous coding systems on real-world end-to-end feature-driven software development tasks. To ensure verifiable and diverse training, SWE-Dev uniquely provides all instances with a runnable environment and its developer-authored executable unit tests. This collection not only provides high-quality data for Supervised Fine-Tuning (SFT), but also enables Reinforcement Learning (RL) by delivering accurate reward signals from executable unit tests. We evaluated SWE-Dev across 17 base LLMs, 10 reasoning-focused LLMs, 10 multi-agent systems, and 8 tool-augmented LLM agents. Results show substantial headroom: the best single-turn model reaches only 22.51\% Pass@1 on the hard split, while OpenHands agents improve to 56.44\% but still leave many tasks unsolved. Code is available here https://github.com/DorothyDUUU/SWE-Dev.
翻译:大型语言模型(LLMs)已在多种软件工程任务中展现出强大能力。然而,特征驱动开发——这一在现实世界中极为普遍、涉及为大型现有代码库开发新功能的任务——仍未得到充分探索。为此,我们提出了SWE-Dev,这是首个为评估和训练自主编码系统在真实世界端到端特征驱动软件开发任务上表现而设计的大规模数据集(包含14,000个训练样本和500个测试样本)。为确保训练的可验证性和多样性,SWE-Dev独特地为所有实例提供了可运行的环境及其由开发者编写的可执行单元测试。该数据集不仅为监督微调(SFT)提供了高质量数据,还通过从可执行单元测试中提供精确的奖励信号,实现了强化学习(RL)的训练。我们在17个基础LLM、10个专注于推理的LLM、10个多智能体系统以及8个工具增强的LLM智能体上对SWE-Dev进行了评估。结果显示存在显著的提升空间:最佳的单轮模型在困难划分上仅达到22.51%的Pass@1,而OpenHands智能体将性能提升至56.44%,但仍有大量任务未能解决。代码可在此处获取:https://github.com/DorothyDUUU/SWE-Dev。