LLM agents have made rapid progress on software engineering and ML research tasks, but these advances often assume access to a working runnable environment. For research artifacts released alongside published papers, setting up such an environment from a fresh machine remains a major bottleneck. Existing environment setup benchmarks do not cover the full scope of research artifact deployment, which involves multi-language toolchains, system-level dependencies beyond containers (e.g. GPU/CUDA and kernel configurations), and legacy artifact compatibility. We introduce DeployBench, a multi-domain benchmark of 51 research-artifact deployment tasks spanning AI/ML, computer systems, and scientific computing, covering all these dimensions. Each task is verified by a hidden pipeline that executes the paper's designated experiment and checks its outputs. Evaluating four state-of-the-art LLMs with OpenHands yields pass-rates from 7.8% - 51.0% . Failures are dominated by a completion-judgment problem: 97 of 154 are agent-terminated self-stops, where the agent's pre-finish checks validate a different or weaker target than the paper-specific task requires. DeployBench highlights the gap between current agents and autonomous deployment, and offers a realistic testbed for scientific research agents.
翻译:LLM智能体在软件工程和机器学习研究任务上取得了快速进展,但这些进展通常假设能够访问一个可运行的实验环境。对于与已发表论文一同发布的研究制品而言,从一台全新的机器搭建这样的环境仍然是主要瓶颈。现有的环境搭建基准测试并未涵盖研究制品部署的全部范围——这涉及多语言工具链、超越容器的系统级依赖(如GPU/CUDA和内核配置)以及遗留制品的兼容性。我们提出了DeployBench,这是一个涵盖AI/ML、计算机系统和科学计算三个领域、共51项研究制品部署任务的多领域基准测试,覆盖了上述所有维度。每项任务均通过一个隐藏流水线进行验证,该流水线执行论文指定的实验并检查其输出结果。使用OpenHands对四种最先进的LLM进行评估,其通过率介于7.8%至51.0%之间。失败主要源于完成判断问题:154次失败中有97次属于智能体主动终止的自停情况,即智能体在完成前的检查所验证的目标与论文特定任务所要求的目标不同或更弱。DeployBench凸显了当前智能体与自主部署之间的差距,并为科研智能体提供了一个现实的测试平台。