EnvBench: A Benchmark for Automated Environment Setup

Recent advances in Large Language Models (LLMs) have enabled researchers to focus on practical repository-level tasks in software engineering domain. In this work, we consider a cornerstone task for automating work with software repositories-environment setup, i.e., a task of configuring a repository-specific development environment on a system. Existing studies on environment setup introduce innovative agentic strategies, but their evaluation is often based on small datasets that may not capture the full range of configuration challenges encountered in practice. To address this gap, we introduce a comprehensive environment setup benchmark EnvBench. It encompasses 329 Python and 665 JVM-based (Java, Kotlin) repositories, with a focus on repositories that present genuine configuration challenges, excluding projects that can be fully configured by simple deterministic scripts. To enable further benchmark extension and usage for model tuning, we implement two automatic metrics: a static analysis check for missing imports in Python and a compilation check for JVM languages. We demonstrate the applicability of our benchmark by evaluating three environment setup approaches, including a simple zero-shot baseline and two agentic workflows, that we test with two powerful LLM backbones, GPT-4o and GPT-4o-mini. The best approach manages to successfully configure 6.69% repositories for Python and 29.47% repositories for JVM, suggesting that EnvBench remains challenging for current approaches. Our benchmark suite is publicly available at https://github.com/JetBrains-Research/EnvBench. The dataset and experiment trajectories are available at https://jb.gg/envbench.

翻译：近年来，大语言模型（LLM）的进展使得研究者能够聚焦于软件工程领域中实际的仓库级任务。在本工作中，我们考虑一个自动化处理软件仓库的基石任务——环境配置，即在系统上配置特定于仓库的开发环境。现有关于环境配置的研究引入了创新的智能体策略，但其评估通常基于小型数据集，可能无法涵盖实践中遇到的全部配置挑战。为弥补这一不足，我们引入了一个全面的环境配置基准测试 EnvBench。它涵盖了 329 个 Python 仓库和 665 个基于 JVM（Java、Kotlin）的仓库，重点关注那些存在真实配置挑战的仓库，排除了可通过简单确定性脚本完全配置的项目。为了支持基准测试的进一步扩展和用于模型调优，我们实现了两个自动评估指标：针对 Python 的缺失导入静态分析检查，以及针对 JVM 语言的编译检查。我们通过评估三种环境配置方法（包括一个简单的零样本基线方法和两种智能体工作流，并使用 GPT-4o 和 GPT-4o-mini 这两个强大的 LLM 主干进行测试）来展示我们基准测试的适用性。最佳方法成功配置了 6.69% 的 Python 仓库和 29.47% 的 JVM 仓库，这表明 EnvBench 对当前方法而言仍具挑战性。我们的基准测试套件已在 https://github.com/JetBrains-Research/EnvBench 公开提供。数据集和实验轨迹可在 https://jb.gg/envbench 获取。