With the rapid evolution of LLMs, automated software testing is witnessing a paradigm shift. While proprietary models like GPT-4o demonstrate impressive capabilities, their high deployment costs and data privacy concerns make open-source LLMs the practical imperative for many academic and industrial scenarios. In the field of automated test generation, it has evolved to iterative workflows to construct test suites based on LLMs. When utilizing open-source LLMs, we empirically observe they lack a suite-level perspective, suffering from structural myopia-failing to generate new tests with large marginal gain based on the current covered status. In this paper, from the perspective of sequences, we formalize test suite generation as a MDP and demonstrate that its objective exhibits monotone submodularity, which enables an effective relaxation of this NP-hard global optimization into a tractable step-wise greedy procedure. Guided by this insight, we propose TestDecision, which transforms LLMs into neural greedy experts. TestDecision consists of two synergistic components: (1) an inference framework which implements test suite construction following a step-wise greedy strategy; and (2) a training pipeline of reinforcement learning which equips the base LLM with sequential test generation ability to maximize marginal gain. Comprehensive evaluations on the ULT benchmark demonstrate that TestDecision significantly outperforms existing advanced methods. It brings an improvement between 38.15-52.37% in branch coverage and 298.22-558.88% in execution pass rate over all base models, achieving a comparable performance on 7B backbone with a much larger proprietary LLM GPT-5.2. Furthermore, TestDecision can find 58.43-95.45% more bugs than vanilla base LLMs and exhibit superior generalization on LiveCodeBench, proving its capability to construct high-quality test suites.
翻译:[translated abstract in Chinese]
随着大语言模型(LLMs)的快速演进,自动化软件测试正经历范式转变。尽管GPT-4o等专有模型展现出卓越能力,但其高昂的部署成本与数据隐私问题使得开源LLM成为众多学术与工业场景的实际选择。在自动化测试生成领域,现有方法已发展为基于LLM的迭代式工作流来构建测试套件。通过实证观察,我们发现开源LLM在套件级视角存在结构性短视——无法基于当前覆盖状态生成具有显著边际增益的新测试用例。本文从序列化视角出发,将测试套件生成形式化为马尔可夫决策过程(MDP),证明其目标函数具有单调子模性,从而可将该NP难全局优化问题有效松弛为可解的逐步贪心过程。基于这一洞见,我们提出TestDecision方法,将LLM转化为神经贪心专家。TestDecision包含两个协同组件:(1)遵循逐步贪心策略实现测试套件构建的推理框架;(2)通过强化学习训练流程使基础LLM具备最大化边际增益的序列化测试生成能力。在ULT基准上的全面评估表明,TestDecision显著优于现有先进方法。在所有基础模型上,其分支覆盖率提升38.15-52.37%,执行通过率提高298.22-558.88%,基于7B骨干网络的性能可与更大规模专有LLM GPT-5.2相匹敌。此外,TestDecision比原始基础LLM多发现58.43-95.45%的缺陷,并在LiveCodeBench上展现出优越的泛化能力,证明其构建高质量测试套件的能力。