In this paper, we systematically study the quality of 4,066 ChatGPT-generated code implemented in two popular programming languages, i.e., Java and Python, for 2,033 programming tasks. The goal of this work is three folds. First, we analyze the correctness of ChatGPT on code generation tasks and uncover the factors that influence its effectiveness, including task difficulty, programming language, time that tasks are introduced, and program size. Second, we identify and characterize potential issues with the quality of ChatGPT-generated code. Last, we provide insights into how these issues can be mitigated. Experiments highlight that out of 4,066 programs generated by ChatGPT, 2,757 programs are deemed correct, 1,081 programs provide wrong outputs, and 177 programs contain compilation or runtime errors. Additionally, we further analyze other characteristics of the generated code through static analysis tools, such as code style and maintainability, and find that 1,933 ChatGPT-generated code snippets suffer from maintainability issues. Subsequently, we investigate ChatGPT's self-debugging ability and its interaction with static analysis tools to fix the errors uncovered in the previous step. Experiments suggest that ChatGPT can partially address these challenges, improving code quality by more than 20%, but there are still limitations and opportunities for improvement. Overall, our study provides valuable insights into the current limitations of ChatGPT and offers a roadmap for future research and development efforts to enhance the code generation capabilities of AI models like ChatGPT.
翻译:本文系统性地研究了ChatGPT在两种主流编程语言(Java和Python)中为2,033个编程任务生成的4,066份代码质量。研究目标包含三个方面:首先,分析ChatGPT在代码生成任务中的正确性,揭示影响其有效性的因素,包括任务难度、编程语言、任务引入时间及程序规模;其次,识别并描述ChatGPT生成代码的潜在质量问题特征;最后,提出缓解这些问题的可行方案。实验结果表明,在ChatGPT生成的4,066个程序中,2,757个程序输出正确,1,081个程序存在错误输出,177个程序包含编译或运行时错误。此外,我们通过代码风格和可维护性等静态分析工具进一步分析生成代码的其他特征,发现1,933个ChatGPT生成代码片段存在可维护性问题。随后,我们考察ChatGPT的自我调试能力及其与静态分析工具的交互机制,以修复前期发现的错误。实验表明,ChatGPT能部分解决这些问题,使代码质量提升20%以上,但仍存在改进空间与局限性。本研究为揭示ChatGPT当前局限性提供了重要见解,并为增强ChatGPT等AI模型的代码生成能力指明了未来研发方向。