学习生成用于自动化调试的单元测试 (Learning to Generate Unit Tests for Automated Debugging)

Unit tests (UTs) play an instrumental role in assessing code correctness as well as providing feedback to large language models (LLMs), motivating automated test generation. However, we uncover a trade-off between generating unit test inputs that reveal errors when given a faulty code and correctly predicting the unit test output without access to the gold solution. To address this trade-off, we propose UTGen, which teaches LLMs to generate unit test inputs that reveal errors along with their correct expected outputs based on task descriptions. Since model-generated tests can provide noisy signals (e.g., from incorrectly predicted outputs), we propose UTDebug that (i) scales UTGen via test-time compute to improve UT output prediction, and (ii) validates and backtracks edits based on multiple generated UTs to avoid overfitting, and helps LLMs debug effectively. We show that UTGen outperforms other LLM-based baselines by 7.59% based on a metric measuring the presence of both error-revealing UT inputs and correct UT outputs. When used with UTDebug, we find that feedback from UTGen's unit tests improves pass@1 accuracy of Qwen2.5 32B on HumanEvalFix and our own harder debugging split of MBPP+ by over 3.17% and 12.35% (respectively) over other LLM-based UT generation baselines. Lastly, we demonstrate that UTGen is a better judge for code correctness, outperforming a state-of-the-art trained 8B reward model by 4.43% on HumanEval+ with best-of-10 sampling using Qwen2.5 7B.

翻译：单元测试（UTs）在评估代码正确性以及为大型语言模型（LLMs）提供反馈方面发挥着关键作用，这推动了自动化测试生成的研究。然而，我们发现存在一个权衡：在给定有缺陷代码时，生成能够揭示错误的单元测试输入，与在不访问黄金解决方案的情况下正确预测单元测试输出之间难以兼顾。为解决这一权衡，我们提出了UTGen，它教导LLMs根据任务描述生成既能揭示错误又具有正确预期输出的单元测试输入。由于模型生成的测试可能提供噪声信号（例如，来自错误预测的输出），我们进一步提出了UTDebug，它（i）通过测试时计算扩展UTGen以改进UT输出预测，以及（ii）基于多个生成的UT验证并回溯代码编辑以避免过拟合，从而帮助LLMs有效调试。我们表明，基于一个同时衡量错误揭示UT输入和正确UT输出存在性的指标，UTGen优于其他基于LLM的基线方法7.59%。当与UTDebug结合使用时，我们发现来自UTGen单元测试的反馈，相较于其他基于LLM的UT生成基线，将Qwen2.5 32B在HumanEvalFix以及我们自建的更难的MBPP+调试子集上的pass@1准确率分别提升了超过3.17%和12.35%。最后，我们证明UTGen是更好的代码正确性评判者，在使用Qwen2.5 7B进行10选1最佳采样时，其在HumanEval+上的表现优于一个最先进的已训练的8B奖励模型4.43%。

相关内容

Automator

关注 5

Automator是苹果公司为他们的Mac OS X系统开发的一款软件。 只要通过点击拖拽鼠标等操作就可以将一系列动作组合成一个工作流，从而帮助你自动的（可重复的）完成一些复杂的工作。Automator还能横跨很多不同种类的程序，包括：查找器、Safari网络浏览器、iCal、地址簿或者其他的一些程序。它还能和一些第三方的程序一起工作，如微软的Office、Adobe公司的Photoshop或者Pixelmator等。

【CVPR 2022】一种无需使用负样本的自监督学习方法，Self-Supervised Predictive Learning: A Negative-Free Method for Sound Source Localization in Visual Scenes

专知会员服务

15+阅读 · 2022年3月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日