Large language models (LLMs) like GPT4, have shown proficiency in generating code snippets from problem statements. Traditionally software development by humans followed a similar methodology of writing code from problem statements or requirements. However, in the past, there have been several studies that have shown the value of test-driven development (TDD) where humans write tests based on problem statements before the code for the functionality is written. In the context of LLM-based code generation, one obvious benefit of TDD is that the developer then knows for sure if the generated code has passed all the given tests or not. Therefore, in this paper, we want to empirically evaluate the hypothesis: giving the problem statements and tests as input to GPT4 is better than just giving the problem statement as input. To test our hypothesis, we build a framework TGen. In our experiments on the MBPP, HumanEval and CodeChef datasets, we consistently find that including tests solves more programming problems than not including them. Thus we show that TDD is a better development model than just using a problem statement when using GPT4 for code generation tasks.
翻译:像GPT4这样的大型语言模型在根据问题描述生成代码片段方面表现出色。传统上,人类软件开发遵循类似的方法:根据问题描述或需求编写代码。然而,过去已有若干研究表明测试驱动开发(TDD)的价值,即人类在编写功能代码之前先根据问题描述编写测试。在基于LLM的代码生成场景中,TDD的一个明显优势是开发者能够确切知道生成的代码是否通过了所有给定测试。因此,本文旨在通过实证评估以下假设:将问题描述和测试作为GPT4的输入,比仅输入问题描述效果更好。为了验证这一假设,我们构建了TGen框架。在MBPP、HumanEval和CodeChef数据集上的实验中,我们一致发现,包含测试输入的方案比不包含测试的方案能解决更多编程问题。因此,我们证明在使用GPT4进行代码生成任务时,TDD是一种比仅使用问题描述更优的开发模式。