Evaluating LLM-Based Test Generation Under Software Evolution

Large Language Models (LLMs) are increasingly used for automated unit test generation. However, it remains unclear whether these tests reflect genuine reasoning about program behavior or simply reproduce superficial patterns learned during training. If the latter dominates, LLM-generated tests may exhibit weaknesses such as reduced coverage, missed regressions, and undetected faults. Understanding how LLMs generate tests and how those tests respond to code evolution is therefore essential. We present a large-scale empirical study of LLM-based test generation under program changes. Using an automated mutation-driven framework, we analyze how generated tests react to semantic-altering changes (SAC) and semantic-preserving changes (SPC) across eight LLMs and 22,374 program variants. LLMs achieve strong baseline results, reaching 79% line coverage and 76% branch coverage with fully passing test suites on the original programs. However, performance degrades as programs evolve. Under SACs, the pass rate of newly generated tests drops to 66%, and branch coverage declines to 60%. More than 99% of failing SAC tests pass on the original program while executing the modified region, indicating residual alignment with the original behavior rather than adaptation to updated semantics. Performance also declines under SPCs despite unchanged functionality: pass rates fall to 79% and branch coverage to 69%. Although SPC edits preserve semantics, they often introduce larger syntactic changes, leading to instability in generated test suites. Models generate more new tests while discarding many baseline tests, suggesting sensitivity to lexical changes rather than true semantic impact. Overall, our results indicate that current LLM-based test generation relies heavily on surface-level cues and struggles to maintain regression awareness as programs evolve.

翻译：大语言模型（LLMs）正越来越多地被用于自动化单元测试生成。然而，这些测试是反映了对程序行为的真正推理，还是仅仅复现了训练期间学习到的表层模式，这一点仍不明确。如果后者占主导地位，那么LLM生成的测试可能会出现缺陷，例如覆盖率降低、遗漏回归错误以及未能检测出故障。因此，理解LLM如何生成测试以及这些测试如何响应代码演化至关重要。我们针对程序变更下基于LLM的测试生成开展了一项大规模实证研究。利用一个自动化突变驱动框架，我们分析了生成的测试在语义变更（SAC）和语义保留变更（SPC）下的反应，涉及八个LLM和22,374个程序变体。LLM在原始程序上取得了强大的基准结果，达到了79%的语句覆盖率和76%的分支覆盖率，且测试套件全部通过。然而，随着程序的演化，性能出现下降。在SAC下，新生成的测试通过率降至66%，分支覆盖率降至60%。超过99%失败的SAC测试在原始程序上能通过，同时执行修改区域，这表明它们与原始行为存在残余对齐，而非适应更新后的语义。尽管功能不变，在SPC下性能也有所下降：通过率降至79%，分支覆盖率降至69%。虽然SPC编辑保留了语义，但它们通常引入更大的语法变化，导致生成的测试套件不稳定。模型生成更多新测试，同时丢弃许多基线测试，这表明它们对词法变化敏感，而非真正的语义影响。总体而言，我们的结果表明，当前基于LLM的测试生成严重依赖表层线索，并且在程序演化时难以保持回归感知能力。