基于新版本LLM的测试生成技术表现如何？ (How well LLM-based test generation techniques perform with newer LLM versions?)

The rapid evolution of Large Language Models (LLMs) has strongly impacted software engineering, leading to a growing number of studies on automated unit test generation. However, the standalone use of LLMs without post-processing has proven insufficient, often producing tests that fail to compile or achieve high coverage. Several techniques have been proposed to address these issues, reporting improvements in test compilation and coverage. While important, LLM-based test generation techniques have been evaluated against relatively weak baselines (for todays' standards), i.e., old LLM versions and relatively weak prompts, which may exacerbate the performance contribution of the approaches. In other words, stronger (newer) LLMs may obviate any advantage these techniques bring. We investigate this issue by replicating four state-of-the-art LLM-based test generation tools, HITS, SymPrompt, TestSpark, and CoverUp that include engineering components aimed at guiding the test generation process through compilation and execution feedback, and evaluate their relative effectiveness and efficiency over a plain LLM test generation method. We integrate current LLM versions in all approaches and run an experiment on 393 classes and 3,657 methods. Our results show that the plain LLM approach can outperform previous state-of-the-art approaches in all test effectiveness metrics we used: line coverage (by 17.72%), branch coverage (by 19.80%) and mutation score (by 20.92%), and it does so at a comparable cost (LLM queries). We also observe that the granularity at which the plain LLM is applied has a significant impact on the cost. We therefore propose targeting first the program classes, where test generation is more efficient, and then the uncovered methods to reduce the number of LLM requests. This strategy achieves comparable (slightly higher) effectiveness while requiring about 20% fewer LLM requests.

翻译：大型语言模型（LLM）的快速发展已对软件工程产生深远影响，催生了越来越多关于自动化单元测试生成的研究。然而，未经后处理的LLM独立使用已被证明存在不足，常产生无法编译或覆盖率低的测试用例。为应对这些问题，已有多种技术被提出，据称能提升测试编译成功率和覆盖率。尽管这些基于LLM的测试生成技术具有重要意义，但其评估所采用的基线（按当今标准）相对薄弱——即旧版LLM与相对简单的提示策略，这可能放大所提方法的性能贡献。换言之，更强（更新）的LLM可能使这些技术带来的优势不复存在。我们通过复现四种前沿的基于LLM的测试生成工具（HITS、SymPrompt、TestSpark和CoverUp）来研究此问题，这些工具均包含旨在通过编译与执行反馈引导测试生成过程的工程组件，并评估其相对于朴素LLM测试生成方法的效能与效率。我们在所有方法中集成当前版本的LLM，并对393个类与3,657个方法开展实验。结果表明：朴素LLM方法在所有测试效能指标上均优于先前的前沿方法——行覆盖率（提升17.72%）、分支覆盖率（提升19.80%）和变异分数（提升20.92%），且成本（LLM查询次数）相当。我们还发现朴素LLM的应用粒度对成本有显著影响。因此，我们建议首先针对程序类进行测试生成（此阶段效率更高），再针对未覆盖的方法生成测试，以减少LLM请求次数。该策略在实现相当（略高）效能的同时，可减少约20%的LLM请求。