Scaling up executable code data is significant for improving language models' software engineering capability. The intricate nature of the process makes it labor-intensive, time-consuming and expert-knowledge-dependent to build a large number of executable code repositories, limiting the scalability of existing work based on running tests. The primary bottleneck lies in the automated building of test environments for different repositories, which is an essential yet underexplored task. To mitigate the gap, we introduce Repo2Run, the first LLM-based agent aiming at automating the building of executable test environments for any repositories at scale. Specifically, given a code repository, Repo2Run iteratively builds the Docker image, runs unit tests based on the feedback of the building, and synthesizes the Dockerfile until the entire pipeline is executed successfully. The resulting Dockerfile can then be used to create Docker container environments for running code and tests. We created a benchmark containing 420 Python repositories with unit tests for evaluation. The results illustrate that Repo2Run achieves an 86.0% success rate, outperforming SWE-agent by 77.0%. The resources of Repo2Run are available at https://github.com/bytedance/Repo2Run.
翻译:扩大可执行代码数据规模对于提升语言模型的软件工程能力至关重要。然而,构建大量可执行代码仓库的过程极为复杂,通常需要密集的人力、耗费大量时间且依赖专家知识,这限制了现有基于运行测试的工作的可扩展性。主要瓶颈在于为不同仓库自动化构建测试环境,这是一项至关重要但尚未被充分探索的任务。为弥补这一不足,我们提出了Repo2Run,这是首个基于大语言模型(LLM)的智能体,旨在为任意仓库大规模自动化构建可执行测试环境。具体而言,给定一个代码仓库,Repo2Run会迭代地构建Docker镜像,根据构建反馈运行单元测试,并综合生成Dockerfile,直至整个流水线成功执行。生成的Dockerfile随后可用于创建运行代码和测试的Docker容器环境。我们创建了一个包含420个带有单元测试的Python仓库的基准数据集进行评估。结果表明,Repo2Run实现了86.0%的成功率,较SWE-agent高出77.0%。Repo2Run的相关资源已发布于https://github.com/bytedance/Repo2Run。