Fight Fire with Fire: How Much Can We Trust ChatGPT on Source Code-Related Tasks?

With the increasing utilization of large language models such as ChatGPT during software development, it has become crucial to verify the quality of code content it generates. Recent studies proposed utilizing ChatGPT as both a developer and tester for multi-agent collaborative software development. The multi-agent collaboration empowers ChatGPT to produce test reports for its generated code, enabling it to self-verify the code content and fix bugs based on these reports. However, these studies did not assess the effectiveness of the generated test reports in validating the code. Therefore, we conduct a comprehensive empirical investigation to evaluate ChatGPT's self-verification capability in code generation, code completion, and program repair. We request ChatGPT to (1) generate correct code and then self-verify its correctness; (2) complete code without vulnerabilities and then self-verify for the presence of vulnerabilities; and (3) repair buggy code and then self-verify whether the bugs are resolved. Our findings on two code generation datasets, one code completion dataset, and two program repair datasets reveal the following observations: (1) ChatGPT often erroneously predicts its generated incorrect code as correct. (2) The self-contradictory hallucinations in ChatGPT's behavior arise. (3) The self-verification capability of ChatGPT can be enhanced by asking the guiding question, which queries whether ChatGPT agrees with assertions about incorrectly generated or repaired code and vulnerabilities in completed code. (4) Using test reports generated by ChatGPT can identify more vulnerabilities in completed code, but the explanations for incorrectly generated code and failed repairs are mostly inaccurate in the test reports. Based on these findings, we provide implications for further research or development using ChatGPT.

翻译：随着ChatGPT等大型语言模型在软件开发中的日益广泛应用，验证其生成代码内容的质量变得至关重要。近期研究提出将ChatGPT同时作为开发者和测试者，用于多智能体协同软件开发。这种多智能体协作使ChatGPT能够为其生成的代码生成测试报告，从而实现对代码内容的自我验证并基于报告修复缺陷。然而，这些研究并未评估所生成测试报告在验证代码方面的有效性。为此，我们开展了一项全面的实证研究，以评估ChatGPT在代码生成、代码补全和程序修复任务中的自我验证能力。我们要求ChatGPT：（1）生成正确代码并自我验证其正确性；（2）完成无漏洞的代码并自我验证是否存在漏洞；（3）修复有缺陷的代码并自我验证缺陷是否已解决。我们在两个代码生成数据集、一个代码补全数据集和两个程序修复数据集上的研究结果揭示了以下现象：（1）ChatGPT经常错误地将其生成的不正确代码预测为正确。（2）ChatGPT行为中出现自相矛盾的幻觉现象。（3）通过提出引导性问题（即询问ChatGPT是否同意关于错误生成/修复的代码以及已补全代码中存在漏洞的断言），可以增强其自我验证能力。（4）使用ChatGPT生成的测试报告能识别出更多已补全代码中的漏洞，但测试报告中关于错误生成代码和修复失败的说明大多不准确。基于这些发现，我们为未来使用ChatGPT的研究与开发提供了启示。