Fight Fire with Fire: How Much Can We Trust ChatGPT on Source Code-Related Tasks?

With the increasing utilization of large language models such as ChatGPT during software development, it has become crucial to verify the quality of code content it generates. Recent studies proposed utilizing ChatGPT as both a developer and tester for multi-agent collaborative software development. The multi-agent collaboration empowers ChatGPT to produce test reports for its generated code, enabling it to self-verify the code content and fix bugs based on these reports. However, these studies did not assess the effectiveness of the generated test reports in validating the code. Therefore, we conduct a comprehensive empirical investigation to evaluate ChatGPT's self-verification capability in code generation, code completion, and program repair. We request ChatGPT to (1) generate correct code and then self-verify its correctness; (2) complete code without vulnerabilities and then self-verify for the presence of vulnerabilities; and (3) repair buggy code and then self-verify whether the bugs are resolved. Our findings on two code generation datasets, one code completion dataset, and two program repair datasets reveal the following observations: (1) ChatGPT often erroneously predicts its generated incorrect code as correct. (2) The self-contradictory hallucinations in ChatGPT's behavior arise. (3) The self-verification capability of ChatGPT can be enhanced by asking the guiding question, which queries whether ChatGPT agrees with assertions about incorrectly generated or repaired code and vulnerabilities in completed code. (4) Using test reports generated by ChatGPT can identify more vulnerabilities in completed code, but the explanations for incorrectly generated code and failed repairs are mostly inaccurate in the test reports. Based on these findings, we provide implications for further research or development using ChatGPT.

翻译：随着ChatGPT等大型语言模型在软件开发中的广泛应用，验证其生成的代码质量已成为关键问题。近期研究提出将ChatGPT同时用作开发者和测试者，以实现多智能体协同软件开发。多智能体协作使ChatGPT能够为其生成的代码生成测试报告，从而使其能够自我验证代码内容并根据报告修复缺陷。然而，这些研究并未评估生成的测试报告在验证代码方面的有效性。为此，我们开展了一项全面的实证研究，评估ChatGPT在代码生成、代码补全和程序修复中的自我验证能力。我们要求ChatGPT：（1）生成正确代码并自我验证其正确性；（2）完成无漏洞代码并自我验证漏洞是否存在；（3）修复缺陷代码后自我验证缺陷是否已解决。基于两个代码生成数据集、一个代码补全数据集和两个程序修复数据集的研究结果揭示了以下发现：（1）ChatGPT常错误地将自身生成的不正确代码预测为正确；（2）ChatGPT的行为中出现了自相矛盾的幻觉；（3）通过提出引导性问题（即询问ChatGPT是否同意关于不正确生成或修复代码的断言，以及已完成代码中的漏洞），可增强ChatGPT的自我验证能力；（4）使用ChatGPT生成的测试报告能识别已完成代码中的更多漏洞，但测试报告中对不正确生成代码和修复失败的说明大多不准确。基于这些发现，我们为利用ChatGPT的进一步研究或开发提供了启示。