TAM-Eval: Evaluating LLMs for Automated Unit Test Maintenance

Elena Bruches,Vadim Alperovich,Dari Baturova,Roman Derunets,Daniil Grebenkin,Georgy Mkrtchyan,Oleg Sedukhin,Mikhail Klementev,Ivan Bondarenko,Nikolay Bushkov,Stanislav Moiseev

from arxiv, Accepted for publication at the 9th Workshop on Validation, Analysis and Evolution of Software Tests (VST 2026), co-located with the the 33rd IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER 2026)

While Large Language Models (LLMs) have shown promise in software engineering, their application to unit testing remains largely confined to isolated test generation or oracle prediction, neglecting the broader challenge of test suite maintenance. We introduce TAM-Eval (Test Automated Maintenance Evaluation), a framework and benchmark designed to evaluate model performance across three core test maintenance scenarios: creation, repair, and updating of test suites. Unlike prior work limited to function-level tasks, TAM-Eval operates at the test file level, while maintaining access to full repository context during isolated evaluation, better reflecting real-world maintenance workflows. Our benchmark comprises 1,539 automatically extracted and validated scenarios from Python, Java, and Go projects. TAM-Eval supports system-agnostic evaluation of both raw LLMs and agentic workflows, using a reference-free protocol based on test suite pass rate, code coverage, and mutation testing. Empirical results indicate that state-of-the-art LLMs have limited capabilities in realistic test maintenance processes and yield only marginal improvements in test effectiveness. We release TAM-Eval as an open-source framework to support future research in automated software testing. Our data and code are publicly available at https://github.com/trndcenter/TAM-Eval.

翻译：尽管大语言模型（LLMs）在软件工程领域展现出潜力，但其在单元测试中的应用大多局限于孤立的测试生成或预言预测，忽略了测试套件维护这一更广泛的挑战。我们提出了TAM-Eval（测试自动化维护评估框架），这是一个旨在评估模型在三种核心测试维护场景（测试套件的创建、修复和更新）中性能的框架与基准。与先前局限于函数级任务的研究不同，TAM-Eval在测试文件级别运行，同时在隔离评估期间保持对完整仓库上下文的访问，从而更好地反映了现实世界的维护工作流程。我们的基准包含从Python、Java和Go项目中自动提取并验证的1,539个场景。TAM-Eval支持对原始大语言模型和智能体工作流程进行与系统无关的评估，其评估协议基于测试套件通过率、代码覆盖率和变异测试，且无需参考输出。实证结果表明，最先进的大语言模型在现实的测试维护过程中能力有限，对测试有效性的提升微乎其微。我们将TAM-Eval作为开源框架发布，以支持未来在自动化软件测试领域的研究。我们的数据和代码已在 https://github.com/trndcenter/TAM-Eval 公开。