Reliable Docker-based environment construction is a dominant bottleneck for scaling execution-grounded training and evaluation of software engineering agents. We introduce DockSmith, a specialized agentic Docker builder designed to address this challenge. DockSmith treats environment construction not only as a preprocessing step, but as a core agentic capability that exercises long-horizon tool use, dependency reasoning, and failure recovery, yielding supervision that transfers beyond Docker building itself. DockSmith is trained on large-scale, execution-grounded Docker-building trajectories produced by a SWE-Factory-style pipeline augmented with a loop-detection controller and a cross-task success memory. Training a 30B-A3B model on these trajectories achieves open-source state-of-the-art performance on Multi-Docker-Eval, with 39.72% Fail-to-Pass and 58.28% Commit Rate. Moreover, DockSmith improves out-of-distribution performance on SWE-bench Verified, SWE-bench Multilingual, and Terminal-Bench 2.0, demonstrating broader agentic benefits of environment construction.
翻译:基于Docker的可靠环境构建是扩展软件工程智能体执行级训练与评估的主要瓶颈。我们提出DockSmith——一种专用智能Docker构建器以应对这一挑战。DockSmith将环境构建不仅视为预处理步骤,更是核心智能体能力——通过长程工具使用、依赖推理与故障恢复训练,生成可迁移至Docker构建之外的监督信号。DockSmith基于SWE-Factory风格流水线(辅以循环检测控制器与跨任务成功记忆)生成的大规模执行级Docker构建轨迹进行训练。在此轨迹上训练的30B-A3B模型在Multi-Docker-Eval评测中达到开源最优性能:Fail-to-Pass为39.72%、Commit Rate为58.28%。此外,DockSmith在SWE-bench Verified、SWE-bench Multilingual及Terminal-Bench 2.0上的分布外测试中亦展现出性能提升,印证了环境构建更广泛的智能体效益。