This paper aims to evaluate GitHub Copilot's generated code quality based on the LeetCode problem set using a custom automated framework. We evaluate the results of Copilot for 4 programming languages: Java, C++, Python3 and Rust. We aim to evaluate Copilot's reliability in the code generation stage, the correctness of the generated code and its dependency on the programming language, problem's difficulty level and problem's topic. In addition to that, we evaluate code's time and memory efficiency and compare it to the average human results. In total, we generate solutions for 1760 problems for each programming language and evaluate all the Copilot's suggestions for each problem, resulting in over 50000 submissions to LeetCode spread over a 2-month period. We found that Copilot successfully solved most of the problems. However, Copilot was rather more successful in generating code in Java and C++ than in Python3 and Rust. Moreover, in case of Python3 Copilot proved to be rather unreliable in the code generation phase. We also discovered that Copilot's top-ranked suggestions are not always the best. In addition, we analysed how the topic of the problem impacts the correctness rate. Finally, based on statistics information from LeetCode, we can conclude that Copilot generates more efficient code than an average human.
翻译:本文旨在通过定制自动化框架,基于LeetCode问题集评估GitHub Copilot生成代码的质量。我们针对四种编程语言(Java、C++、Python3和Rust)评估Copilot的生成结果,重点考察其在代码生成阶段的可靠性、生成代码的正确性,以及其对编程语言、问题难度级别和问题主题的依赖性。此外,我们评估了生成代码的时间与内存效率,并将其与人类平均结果进行比较。我们总计为每种编程语言生成了1760个问题的解决方案,并对每个问题的所有Copilot建议进行了评估,在两个月内向LeetCode提交了超过50000次评测。研究发现,Copilot成功解决了大多数问题,但在Java和C++中生成代码的成功率明显高于Python3和Rust。特别在Python3中,Copilot在代码生成阶段表现出较高的不稳定性。我们还发现Copilot的优先级建议并非总是最优解。此外,我们分析了问题主题对正确率的影响。最后,基于LeetCode的统计数据可以得出结论:Copilot生成的代码效率优于人类平均水平。