A case study on the transformative potential of AI in software engineering on LeetCode and ChatGPT

The recent surge in the field of generative artificial intelligence (GenAI) has the potential to bring about transformative changes across a range of sectors, including software engineering and education. As GenAI tools, such as OpenAI's ChatGPT, are increasingly utilised in software engineering, it becomes imperative to understand the impact of these technologies on the software product. This study employs a methodological approach, comprising web scraping and data mining from LeetCode, with the objective of comparing the software quality of Python programs produced by LeetCode users with that generated by GPT-4o. In order to gain insight into these matters, this study addresses the question whether GPT-4o produces software of superior quality to that produced by humans. The findings indicate that GPT-4o does not present a considerable impediment to code quality, understandability, or runtime when generating code on a limited scale. Indeed, the generated code even exhibits significantly lower values across all three metrics in comparison to the user-written code. However, no significantly superior values were observed for the generated code in terms of memory usage in comparison to the user code, which contravened the expectations. Furthermore, it will be demonstrated that GPT-4o encountered challenges in generalising to problems that were not included in the training data set. This contribution presents a first large-scale study comparing generated code with human-written code based on LeetCode platform based on multiple measures including code quality, code understandability, time behaviour and resource utilisation. All data is publicly available for further research.

翻译：生成式人工智能（GenAI）领域的近期蓬勃发展，有望为包括软件工程和教育在内的多个领域带来变革性影响。随着OpenAI的ChatGPT等GenAI工具在软件工程中的日益普及，理解这些技术对软件产品质量的影响变得至关重要。本研究采用网络爬取与数据挖掘相结合的方法论，以LeetCode平台为数据来源，系统比较了LeetCode用户编写的Python程序与GPT-4o生成代码的软件质量。为深入探究该问题，本研究重点考察了GPT-4o是否能够生成优于人类编写的软件代码。研究结果表明：在小规模代码生成任务中，GPT-4o在代码质量、可理解性和运行时间方面均未表现出明显缺陷；事实上，相较于用户编写的代码，生成代码在这三项指标上均呈现显著更优的数值。然而在内存使用效率方面，生成代码并未展现出相对于用户代码的显著优势，这一发现与预期相悖。此外，研究还证明GPT-4o在面对未包含于训练数据集的问题时存在泛化能力不足的挑战。本项研究首次基于LeetCode平台，通过代码质量、可理解性、时间行为与资源利用等多维度指标，开展了生成代码与人工编写代码的大规模系统性比较。所有实验数据均已公开，可供后续研究使用。