FreshBrew: A Benchmark for Evaluating AI Agents on Java Code Migration

AI coding assistants are rapidly becoming integral to modern software development. A key challenge in this space is the continual need to migrate and modernize codebases in response to evolving software ecosystems. Traditionally, such migrations have relied on rule-based systems and human intervention. With the advent of powerful large language models (LLMs), AI-driven agentic frameworks offer a promising alternative-but their effectiveness has not been systematically evaluated. In this paper, we introduce FreshBrew, a novel benchmark for evaluating AI agents on project-level Java migrations, with a specific focus on measuring an agent's ability to preserve program semantics and avoid reward hacking, which we argue requires projects with high test coverage for a rigorous and reliable evaluation. We benchmark several state-of-the-art LLMs, and compare their performance against established rule-based tools. Our evaluation of AI agents on this benchmark of 228 repositories shows that the top-performing model, Gemini 2.5 Flash, can successfully migrate 52.3 percent of projects to JDK 17. Our empirical analysis reveals novel insights into the critical strengths and limitations of current agentic approaches, offering actionable insights into their real-world applicability. Our empirical study reveals failure modes of current AI agents in realistic Java modernization tasks, providing a foundation for evaluating trustworthy code-migration systems. By releasing FreshBrew, we aim to facilitate rigorous, reproducible evaluation and catalyze progress in AI-driven codebase modernization.

翻译：AI编程助手正迅速成为现代软件开发不可或缺的一部分。该领域的一个关键挑战是，为应对不断演进的软件生态系统，持续迁移和现代化代码库的需求日益增长。传统上，这类迁移依赖于基于规则的系统及人工干预。随着强大大型语言模型（LLMs）的出现，AI驱动的代理框架提供了一种有前景的替代方案——但其有效性尚未得到系统评估。本文介绍了FreshBrew，一个用于评估AI代理在项目级Java迁移任务上的新颖基准，特别侧重于衡量代理在保持程序语义和避免奖励破解方面的能力。我们认为，这需要具有高测试覆盖率的项目才能进行严格且可靠的评估。我们对多个最先进的LLM进行了基准测试，并将其性能与成熟的基于规则的工具进行了比较。在包含228个代码库的基准上对AI代理的评估表明，表现最佳的模型Gemini 2.5 Flash能够成功地将52.3%的项目迁移至JDK 17。我们的实证分析揭示了当前代理方法的关键优势和局限性的新见解，为其实际应用提供了可操作的指导。我们的实证研究揭示了当前AI代理在现实Java现代化任务中的失败模式，为评估可信赖的代码迁移系统奠定了基础。通过发布FreshBrew，我们旨在促进严格、可复现的评估，并推动AI驱动的代码库现代化研究取得进展。