With the advancement of automated software engineering, research focus is increasingly shifting toward practical tasks reflecting the day-to-day work of software engineers. Among these tasks, software migration, a critical process of adapting code to evolving environments, has been largely overlooked. In this study, we introduce TimeMachine-bench, a benchmark designed to evaluate software migration in real-world Python projects. Our benchmark consists of GitHub repositories whose tests begin to fail in response to dependency updates. The construction process is fully automated, enabling live updates of the benchmark. Furthermore, we curated a human-verified subset to ensure problem solvability. We evaluated agent-based baselines built on top of 11 models, including both strong open-weight and state-of-the-art LLMs on this verified subset. Our results indicated that, while LLMs show some promise for migration tasks, they continue to face substantial reliability challenges, including spurious solutions that exploit low test coverage and unnecessary edits stemming from suboptimal tool-use strategies. Our dataset and implementation are available at https://github.com/tohoku-nlp/timemachine-bench.
翻译:随着自动化软件工程的进步,研究重点正日益转向反映软件工程师日常工作的实际任务。在这些任务中,软件迁移——一个使代码适应不断变化环境的关键过程——在很大程度上被忽视了。在本研究中,我们介绍了TimeMachine-bench,这是一个旨在评估现实世界Python项目中软件迁移的基准。我们的基准由那些因其依赖项更新而导致测试开始失败的GitHub仓库组成。构建过程完全自动化,使得基准能够实时更新。此外,我们整理了一个经过人工验证的子集,以确保问题的可解性。我们评估了基于11个模型构建的智能体基线,包括强大的开源权重模型和最先进的大语言模型,并在该验证子集上进行了测试。我们的结果表明,尽管大语言模型在迁移任务中显示出一定的潜力,但它们仍然面临重大的可靠性挑战,包括利用低测试覆盖率的虚假解决方案,以及由次优工具使用策略导致的不必要编辑。我们的数据集和实现可在 https://github.com/tohoku-nlp/timemachine-bench 获取。