Even though demonstrating extraordinary capabilities in code generation and software issue resolving, AI agents' capabilities in the full software DevOps cycle are still unknown. Different from pure code generation, handling the DevOps cycle in real-world software, including developing, deploying, and managing, requires analyzing large-scale projects, understanding dynamic program behaviors, leveraging domain-specific tools, and making sequential decisions. However, existing benchmarks focus on isolated problems and lack environments and tool interfaces for DevOps. We introduce DevOps-Gym, the first end-to-end benchmark for evaluating AI agents across core DevOps workflows: build and configuration, monitoring, issue resolving, and test generation. DevOps-Gym includes 700+ real-world tasks collected from 30+ projects in Java and Go. We develop a semi-automated data collection mechanism with rigorous and non-trivial expert efforts in ensuring the task coverage and quality. Our evaluation of state-of-the-art models and agents reveals fundamental limitations: they struggle with issue resolving and test generation in Java and Go, and remain unable to handle new tasks such as monitoring and build and configuration. These results highlight the need for essential research in automating the full DevOps cycle with AI agents.
翻译:尽管在代码生成和软件问题解决方面展现出非凡能力,AI智能体在完整软件DevOps周期中的能力仍属未知。与纯代码生成不同,处理现实世界软件中的DevOps周期(包括开发、部署和管理)需要分析大规模项目、理解动态程序行为、利用领域特定工具以及进行序列决策。然而,现有基准主要关注孤立问题,且缺乏面向DevOps的环境与工具接口。我们提出DevOps-Gym——首个用于评估AI智能体在核心DevOps工作流(构建与配置、监控、问题解决和测试生成)中表现的端到端基准。DevOps-Gym包含从30余个Java与Go项目中收集的700余项现实任务。我们开发了半自动化数据收集机制,通过严格且非平凡的专家努力确保任务覆盖范围与质量。对前沿模型与智能体的评估揭示了其根本性局限:它们在Java和Go项目的问题解决与测试生成任务中表现不佳,且仍无法处理监控、构建与配置等新型任务。这些结果凸显了利用AI智能体实现完整DevOps周期自动化所需的关键研究方向。