In recent times, large language models (LLMs) have made significant strides in generating computer code, blurring the lines between code created by humans and code produced by artificial intelligence (AI). As these technologies evolve rapidly, it is crucial to explore how they influence code generation, especially given the risk of misuse in areas like higher education. This paper explores this issue by using advanced classification techniques to differentiate between code written by humans and that generated by ChatGPT, a type of LLM. We employ a new approach that combines powerful embedding features (black-box) with supervised learning algorithms - including Deep Neural Networks, Random Forests, and Extreme Gradient Boosting - to achieve this differentiation with an impressive accuracy of 98%. For the successful combinations, we also examine their model calibration, showing that some of the models are extremely well calibrated. Additionally, we present white-box features and an interpretable Bayes classifier to elucidate critical differences between the code sources, enhancing the explainability and transparency of our approach. Both approaches work well but provide at most 85-88% accuracy. We also show that untrained humans solve the same task not better than random guessing. This study is crucial in understanding and mitigating the potential risks associated with using AI in code generation, particularly in the context of higher education, software development, and competitive programming.
翻译:近年来,大型语言模型(LLMs)在生成计算机代码方面取得了显著进展,模糊了人类编写代码与人工智能(AI)生成代码之间的界限。随着这些技术的快速发展,探索它们如何影响代码生成至关重要,尤其是在高等教育等领域存在滥用风险的情况下。本文通过使用先进的分类技术来区分人类编写的代码与由ChatGPT(一种LLM)生成的代码,以探讨这一问题。我们采用了一种新方法,将强大的嵌入特征(黑盒)与监督学习算法(包括深度神经网络、随机森林和极限梯度提升)相结合,以98%的惊人准确率实现了这种区分。对于成功的组合,我们还检查了它们的模型校准情况,结果表明其中一些模型的校准效果极佳。此外,我们提出了白盒特征和可解释的贝叶斯分类器,以阐明不同代码来源之间的关键差异,从而增强我们方法的可解释性和透明度。这两种方法都表现良好,但最多只能提供85-88%的准确率。我们还表明,未经训练的人类在完成相同任务时并不比随机猜测更好。这项研究对于理解和缓解在代码生成中使用AI可能带来的风险至关重要,特别是在高等教育、软件开发和竞技编程的背景下。