This paper studies recent developments in large language models' (LLM) abilities to pass assessments in introductory and intermediate Python programming courses at the postsecondary level. The emergence of ChatGPT resulted in heated debates of its potential uses (e.g., exercise generation, code explanation) as well as misuses in programming classes (e.g., cheating). Recent studies show that while the technology performs surprisingly well on diverse sets of assessment instruments employed in typical programming classes the performance is usually not sufficient to pass the courses. The release of GPT-4 largely emphasized notable improvements in the capabilities related to handling assessments originally designed for human test-takers. This study is the necessary analysis in the context of this ongoing transition towards mature generative AI systems. Specifically, we report the performance of GPT-4, comparing it to the previous generations of GPT models, on three Python courses with assessments ranging from simple multiple-choice questions (no code involved) to complex programming projects with code bases distributed into multiple files (599 exercises overall). Additionally, we analyze the assessments that were not handled well by GPT-4 to understand the current limitations of the model, as well as its capabilities to leverage feedback provided by an auto-grader. We found that the GPT models evolved from completely failing the typical programming class' assessments (the original GPT-3) to confidently passing the courses with no human involvement (GPT-4). While we identified certain limitations in GPT-4's handling of MCQs and coding exercises, the rate of improvement across the recent generations of GPT models strongly suggests their potential to handle almost any type of assessment widely used in higher education programming courses. These findings could be leveraged by educators and institutions to adapt the design of programming assessments as well as to fuel the necessary discussions into how programming classes should be updated to reflect the recent technological developments. This study provides evidence that programming instructors need to prepare for a world in which there is an easy-to-use widely accessible technology that can be utilized by learners to collect passing scores, with no effort whatsoever, on what today counts as viable programming knowledge and skills assessments.
翻译:本文研究了大型语言模型(LLM)在高等教育初级和中级Python编程课程考核中通过能力的最新进展。ChatGPT的出现引发了关于其在编程课堂中潜在应用(如习题生成、代码解释)以及滥用行为(如作弊)的激烈讨论。近期研究表明,尽管该技术在典型编程课堂使用的多样化评估工具上表现惊人,但其成绩通常不足以通过课程。GPT-4的发布显著提升了其处理原本为人类考生设计的评估任务的能力。本研究正是针对这一向成熟生成式AI系统过渡的关键阶段进行的必要分析。具体而言,我们报告了GPT-4在三个Python课程中的表现(包含从简单选择题到多文件分布式代码库的复杂编程项目,共计599道习题),并将其与先前几代GPT模型进行了对比。此外,我们分析了GPT-4未能妥善处理的评估任务,以理解当前模型的局限性及其利用自动评分系统反馈的能力。研究发现,GPT模型从完全无法通过典型编程课程考核(原始GPT-3)发展到无需人类干预即可轻松通过课程(GPT-4)。尽管我们识别出GPT-4在处理选择题和编程练习时存在某些局限,但近年GPT模型的进步速度强烈表明,它们具有处理高等教育编程课程中广泛使用的几乎所有评估类型的潜力。教育工作者和机构可据此调整编程评估设计,并推动关于如何更新编程课程以反映最新技术发展的必要讨论。本研究提供的证据表明,编程讲师需要为一个新时代做好准备——在这个时代中,存在一种易于获取、广泛可用的技术,学习者可毫不费力地利用它获得当前被认为是有效的编程知识与技能评估的通过分数。