Unit tests (UTs) play an instrumental role in assessing code correctness as well as providing feedback to large language models (LLMs), motivating automated test generation. However, we uncover a trade-off between generating unit test inputs that reveal errors when given a faulty code and correctly predicting the unit test output without access to the gold solution. To address this trade-off, we propose UTGen, which teaches LLMs to generate unit test inputs that reveal errors along with their correct expected outputs based on task descriptions. Since model-generated tests can provide noisy signals (e.g., from incorrectly predicted outputs), we propose UTDebug that (i) scales UTGen via test-time compute to improve UT output prediction, and (ii) validates and backtracks edits based on multiple generated UTs to avoid overfitting, and helps LLMs debug effectively. We show that UTGen outperforms other LLM-based baselines by 7.59% based on a metric measuring the presence of both error-revealing UT inputs and correct UT outputs. When used with UTDebug, we find that feedback from UTGen's unit tests improves pass@1 accuracy of Qwen2.5 32B on HumanEvalFix and our own harder debugging split of MBPP+ by over 3.17% and 12.35% (respectively) over other LLM-based UT generation baselines. Lastly, we demonstrate that UTGen is a better judge for code correctness, outperforming a state-of-the-art trained 8B reward model by 4.43% on HumanEval+ with best-of-10 sampling using Qwen2.5 7B.
翻译:单元测试(UTs)在评估代码正确性以及为大型语言模型(LLMs)提供反馈方面发挥着关键作用,这推动了自动化测试生成的研究。然而,我们发现存在一个权衡:在给定有缺陷代码时,生成能够揭示错误的单元测试输入,与在不访问黄金解决方案的情况下正确预测单元测试输出之间难以兼顾。为解决这一权衡,我们提出了UTGen,它教导LLMs根据任务描述生成既能揭示错误又具有正确预期输出的单元测试输入。由于模型生成的测试可能提供噪声信号(例如,来自错误预测的输出),我们进一步提出了UTDebug,它(i)通过测试时计算扩展UTGen以改进UT输出预测,以及(ii)基于多个生成的UT验证并回溯代码编辑以避免过拟合,从而帮助LLMs有效调试。我们表明,基于一个同时衡量错误揭示UT输入和正确UT输出存在性的指标,UTGen优于其他基于LLM的基线方法7.59%。当与UTDebug结合使用时,我们发现来自UTGen单元测试的反馈,相较于其他基于LLM的UT生成基线,将Qwen2.5 32B在HumanEvalFix以及我们自建的更难的MBPP+调试子集上的pass@1准确率分别提升了超过3.17%和12.35%。最后,我们证明UTGen是更好的代码正确性评判者,在使用Qwen2.5 7B进行10选1最佳采样时,其在HumanEval+上的表现优于一个最先进的已训练的8B奖励模型4.43%。