The transformative influence of Large Language Models (LLMs) is profoundly reshaping the Artificial Intelligence (AI) technology domain. Notably, ChatGPT distinguishes itself within these models, demonstrating remarkable performance in multi-turn conversations and exhibiting code proficiency across an array of languages. In this paper, we carry out a comprehensive evaluation of ChatGPT's coding capabilities based on what is to date the largest catalog of coding challenges. Our focus is on the python programming language and problems centered on data structures and algorithms, two topics at the very foundations of Computer Science. We evaluate ChatGPT for its ability to generate correct solutions to the problems fed to it, its code quality, and nature of run-time errors thrown by its code. Where ChatGPT code successfully executes, but fails to solve the problem at hand, we look into patterns in the test cases passed in order to gain some insights into how wrong ChatGPT code is in these kinds of situations. To infer whether ChatGPT might have directly memorized some of the data that was used to train it, we methodically design an experiment to investigate this phenomena. Making comparisons with human performance whenever feasible, we investigate all the above questions from the context of both its underlying learning models (GPT-3.5 and GPT-4), on a vast array sub-topics within the main topics, and on problems having varying degrees of difficulty.
翻译:大型语言模型(LLM)的变革性影响正在深刻重塑人工智能(AI)技术领域。其中,ChatGPT在该类模型中表现尤为突出,在多轮对话中展现出卓越性能,并在多种编程语言中显示出代码能力。本文基于迄今规模最大的编程挑战题库,对ChatGPT的编程能力进行了系统性评估。我们聚焦于Python编程语言以及计算机科学基础核心领域——数据结构与算法相关题目,从代码正确性、代码质量、运行时错误特征三个维度评估ChatGPT的表现。当ChatGPT生成的代码能成功执行但未能正确解决问题时,我们通过分析其通过的测试用例模式,深入探究其错误代码的内在规律。为验证ChatGPT是否可能直接记忆了训练数据,我们设计了一套严谨的实验方法进行现象研究。在可行范围内与人类表现进行对比的同时,我们基于其底层学习模型(GPT-3.5与GPT-4)、两大主题下的细分领域以及不同难度层级的问题,对上述所有问题展开了多维度探究。