Evaluating the Code Quality of AI-Assisted Code Generation Tools: An Empirical Study on GitHub Copilot, Amazon CodeWhisperer, and ChatGPT

Context: AI-assisted code generation tools have become increasingly prevalent in software engineering, offering the ability to generate code from natural language prompts or partial code inputs. Notable examples of these tools include GitHub Copilot, Amazon CodeWhisperer, and OpenAI's ChatGPT. Objective: This study aims to compare the performance of these prominent code generation tools in terms of code quality metrics, such as Code Validity, Code Correctness, Code Security, Code Reliability, and Code Maintainability, to identify their strengths and shortcomings. Method: We assess the code generation capabilities of GitHub Copilot, Amazon CodeWhisperer, and ChatGPT using the benchmark HumanEval Dataset. The generated code is then evaluated based on the proposed code quality metrics. Results: Our analysis reveals that the latest versions of ChatGPT, GitHub Copilot, and Amazon CodeWhisperer generate correct code 65.2%, 46.3%, and 31.1% of the time, respectively. In comparison, the newer versions of GitHub CoPilot and Amazon CodeWhisperer showed improvement rates of 18% for GitHub Copilot and 7% for Amazon CodeWhisperer. The average technical debt, considering code smells, was found to be 8.9 minutes for ChatGPT, 9.1 minutes for GitHub Copilot, and 5.6 minutes for Amazon CodeWhisperer. Conclusions: This study highlights the strengths and weaknesses of some of the most popular code generation tools, providing valuable insights for practitioners. By comparing these generators, our results may assist practitioners in selecting the optimal tool for specific tasks, enhancing their decision-making process.

翻译：上下文：AI辅助代码生成工具在软件工程领域日益普及，能够根据自然语言提示或部分代码输入生成代码。典型工具包括GitHub Copilot、Amazon CodeWhisperer及OpenAI的ChatGPT。目标：本研究旨在从代码有效性、代码正确性、代码安全性、代码可靠性及代码可维护性等质量指标维度，比较上述主流代码生成工具的性能，以明确其优势与不足。方法：我们使用基准测试集HumanEval对GitHub Copilot、Amazon CodeWhisperer及ChatGPT的代码生成能力进行评估，并根据提出的代码质量指标对生成的代码进行评价。结果：分析表明，最新版本的ChatGPT、GitHub Copilot与Amazon CodeWhisperer生成正确代码的比例分别为65.2%、46.3%和31.1%。相比而言，新版本GitHub Copilot与Amazon CodeWhisperer的改进率分别为18%和7%。在考虑代码异味的情况下，平均技术债务测算值为：ChatGPT 8.9分钟，GitHub Copilot 9.1分钟，Amazon CodeWhisperer 5.6分钟。结论：本研究揭示了部分主流代码生成工具的优势与局限，为实践者提供了重要参考。通过工具间的对比，研究结果有助于实践者针对特定任务选择最优工具，从而优化其决策过程。