Unit testing is a commonly-used approach in software engineering to test the correctness and robustness of written code. Unit tests are tests designed to test small components of a codebase in isolation, such as an individual function or method. Although unit tests have historically been written by human programmers, recent advancements in AI, particularly LLMs, have shown corresponding advances in automatic unit test generation. In this study, we explore the effect of different prompts on the quality of unit tests generated by Code Interpreter, a GPT-4-based LLM, on Python functions provided by the Quixbugs dataset, and we focus on prompting due to the ease with which users can make use of our findings and observations. We find that the quality of the generated unit tests is not sensitive to changes in minor details in the prompts provided. However, we observe that Code Interpreter is often able to effectively identify and correct mistakes in code that it writes, suggesting that providing it runnable code to check the correctness of its outputs would be beneficial, even though we find that it is already often able to generate correctly-formatted unit tests. Our findings suggest that, when prompting models similar to Code Interpreter, it is important to include the basic information necessary to generate unit tests, but minor details are not as important.
翻译:单元测试是软件工程中常用的一种方法,用于检验所编写代码的正确性和健壮性。单元测试旨在隔离测试代码库中的小规模组件,例如单个函数或方法。尽管单元测试历来由人类程序员编写,但人工智能(尤其是大语言模型)的最新进展已显示出在自动生成单元测试方面的相应突破。本研究探讨了不同提示对基于GPT-4的大语言模型Code Interpreter生成的单元测试质量的影响,测试对象为Quixbugs数据集中的Python函数。我们聚焦于提示方法,因为用户可便捷地应用我们的发现与观察结果。研究发现,生成单元测试的质量对提示中细微细节的变化并不敏感。然而,我们观察到Code Interpreter通常能有效识别并纠正其自身编写的代码错误,这表明为其提供可运行代码以检验输出正确性将有所助益——尽管我们发现它通常已能生成格式正确的单元测试。我们的发现表明,在向类似Code Interpreter的模型发出提示时,包含生成单元测试所需的基本信息至关重要,但细微细节则非必要。