The evolution of Large Language Model (LLM) agents for software engineering (SWE) is constrained by the scarcity of verifiable datasets, a bottleneck stemming from the complexity of constructing executable environments across diverse languages. To address this, we introduce MEnvAgent, a Multi-language framework for automated Environment construction that facilitates scalable generation of verifiable task instances. MEnvAgent employs a multi-agent Planning-Execution-Verification architecture to autonomously resolve construction failures and integrates a novel Environment Reuse Mechanism that reduces computational overhead by incrementally patching historical environments. Evaluations on MEnvBench, a new benchmark comprising 1,000 tasks across 10 languages, demonstrate that MEnvAgent outperforms baselines, improving Fail-to-Pass (F2P) rates by 8.6% while reducing time costs by 43%. Additionally, we demonstrate the utility of MEnvAgent by constructing MEnvData-SWE, the largest open-source polyglot dataset of realistic verifiable Docker environments to date, alongside solution trajectories that enable consistent performance gains on SWE tasks across a wide range of models. Our code, benchmark, and dataset are available at https://github.com/ernie-research/MEnvAgent.
翻译:大型语言模型(LLM)智能体在软件工程(SWE)领域的发展受到可验证数据集稀缺的制约,这一瓶颈源于跨多种语言构建可执行环境的复杂性。为解决此问题,我们提出了MEnvAgent,一种支持可扩展生成可验证任务实例的多语言自动化环境构建框架。MEnvAgent采用多智能体“规划-执行-验证”架构,能够自主解决构建失败问题,并集成了一种新颖的环境复用机制,通过增量式修补历史环境来降低计算开销。在包含10种语言共1,000项任务的MEnvBench新基准测试上的评估表明,MEnvAgent优于基线方法,将失败转成功(F2P)率提升了8.6%,同时将时间成本降低了43%。此外,我们通过构建MEnvData-SWE(迄今为止最大的开源多语言真实可验证Docker环境数据集)及其配套的解决方案轨迹,展示了MEnvAgent的实用性,该数据集能够帮助多种模型在软件工程任务上获得持续的性能提升。我们的代码、基准测试及数据集已公开于 https://github.com/ernie-research/MEnvAgent。