Large language model-based agents show promise for software engineering, but environment configuration remains a bottleneck due to heavy manual effort and scarce large-scale, high-quality datasets. Existing benchmarks assess only end-to-end build/test success, obscuring where and why agents succeed or fail. We introduce the Environment Configuration Diagnosis Benchmark, Enconda-bench, which provides process-level trajectory assessment of fine-grained agent capabilities during environment setup-planning, perception-driven error diagnosis, feedback-driven repair, and action to execute final environment configuration. Our task instances are automatically constructed by injecting realistic README errors and are validated in Docker for scalable, high-quality evaluation. Enconda-bench combines process-level analysis with end-to-end executability to enable capability assessments beyond aggregate success rates. Evaluations across state-of-the-art LLMs and agent frameworks show that while agents can localize errors, they struggle to translate feedback into effective corrections, limiting end-to-end performance. To our knowledge, Enconda-bench is the first framework to provide process-level internal capability assessment for environment configuration, offering actionable insights for improving software engineering agents.
翻译:基于大语言模型的智能体在软件工程领域展现出潜力,但环境配置仍因繁重的手动操作和缺乏大规模高质量数据集而成为瓶颈。现有基准仅评估端到端构建/测试的成功率,难以揭示智能体成功或失败的具体环节及原因。我们提出了环境配置诊断基准Enconda-bench,该基准通过过程级轨迹评估,对智能体在环境设置规划、感知驱动的错误诊断、反馈驱动的修复以及执行最终环境配置动作等细粒度能力进行量化分析。我们的任务实例通过注入真实的README错误自动构建,并在Docker环境中验证以实现可扩展的高质量评估。Enconda-bench将过程级分析与端到端可执行性相结合,支持超越聚合成功率的能力评估。对前沿大语言模型及智能体框架的评估表明,尽管智能体能够定位错误,但在将反馈转化为有效修正方面仍存在困难,从而限制了端到端性能。据我们所知,Enconda-bench是首个为环境配置提供过程级内部能力评估的框架,为改进软件工程智能体提供了可操作的见解。