Unit tests (UTs) play an instrumental role in assessing code correctness as well as providing feedback to a large language model (LLM) as it iteratively debugs faulty code, motivating automated test generation. However, we uncover a trade-off between generating unit test inputs that reveal errors when given a faulty code and correctly predicting the unit test output without access to the gold solution. To address this trade-off, we propose UTGen, which teaches LLMs to generate unit test inputs that reveal errors along with their correct expected outputs based on task descriptions and candidate code. We integrate UTGen into UTDebug, a robust debugging pipeline that uses generated tests to help LLMs debug effectively. Since model-generated tests can provide noisy signals (e.g., from incorrectly predicted outputs), UTDebug (i) scales UTGen via test-time compute to improve UT output prediction, and (ii) validates and back-tracks edits based on multiple generated UTs to avoid overfitting. We show that UTGen outperforms UT generation baselines by 7.59% based on a metric measuring the presence of both error-revealing UT inputs and correct UT outputs. When used with UTDebug, we find that feedback from UTGen's unit tests improves pass@1 accuracy of Qwen-2.5 7B on HumanEvalFix and our own harder debugging split of MBPP+ by over 3% and 12.35% (respectively) over other LLM-based UT generation baselines.
翻译:单元测试(UTs)在评估代码正确性以及为大型语言模型(LLM)迭代调试错误代码提供反馈方面发挥着关键作用,这推动了自动化测试生成的研究。然而,我们发现存在一个权衡:在给定错误代码时,生成能够揭示错误的单元测试输入,与在不依赖黄金解决方案的情况下正确预测单元测试输出之间存在矛盾。为解决这一权衡,我们提出了UTGen,该方法教导LLMs根据任务描述和候选代码,生成既能揭示错误又具有正确预期输出的单元测试输入。我们将UTGen集成到UTDebug中,这是一个稳健的调试流程,利用生成的测试帮助LLMs有效调试。由于模型生成的测试可能提供噪声信号(例如,来自错误预测的输出),UTDebug通过(i)利用测试时计算扩展UTGen以改进UT输出预测,以及(ii)基于多个生成的UTs验证并回溯编辑以避免过拟合。我们证明,基于一个同时衡量错误揭示UT输入和正确UT输出存在的指标,UTGen优于基线UT生成方法7.59%。当与UTDebug结合使用时,我们发现UTGen单元测试的反馈将Qwen-2.5 7B在HumanEvalFix以及我们自建的更难的MBPP+调试子集上的pass@1准确率,相较于其他基于LLM的UT生成基线,分别提高了超过3%和12.35%。