Thrilled by Your Progress! Large Language Models (GPT-4) No Longer Struggle to Pass Assessments in Higher Education Programming Courses

This paper studies recent developments in large language models' (LLM) abilities to pass assessments in introductory and intermediate Python programming courses at the postsecondary level. The emergence of ChatGPT resulted in heated debates of its potential uses (e.g., exercise generation, code explanation) as well as misuses in programming classes (e.g., cheating). Recent studies show that while the technology performs surprisingly well on diverse sets of assessment instruments employed in typical programming classes the performance is usually not sufficient to pass the courses. The release of GPT-4 largely emphasized notable improvements in the capabilities related to handling assessments originally designed for human test-takers. This study is the necessary analysis in the context of this ongoing transition towards mature generative AI systems. Specifically, we report the performance of GPT-4, comparing it to the previous generations of GPT models, on three Python courses with assessments ranging from simple multiple-choice questions (no code involved) to complex programming projects with code bases distributed into multiple files (599 exercises overall). Additionally, we analyze the assessments that were not handled well by GPT-4 to understand the current limitations of the model, as well as its capabilities to leverage feedback provided by an auto-grader. We found that the GPT models evolved from completely failing the typical programming class' assessments (the original GPT-3) to confidently passing the courses with no human involvement (GPT-4). While we identified certain limitations in GPT-4's handling of MCQs and coding exercises, the rate of improvement across the recent generations of GPT models strongly suggests their potential to handle almost any type of assessment widely used in higher education programming courses. These findings could be leveraged by educators and institutions to adapt the design of programming assessments as well as to fuel the necessary discussions into how programming classes should be updated to reflect the recent technological developments. This study provides evidence that programming instructors need to prepare for a world in which there is an easy-to-use widely accessible technology that can be utilized by learners to collect passing scores, with no effort whatsoever, on what today counts as viable programming knowledge and skills assessments.

翻译：本文研究了大型语言模型（LLM）在高等教育初级和中级Python编程课程考核中通过能力的最新进展。ChatGPT的出现引发了关于其在编程课堂中潜在应用（如习题生成、代码解释）以及滥用行为（如作弊）的激烈讨论。近期研究表明，尽管该技术在典型编程课堂使用的多样化评估工具上表现惊人，但其成绩通常不足以通过课程。GPT-4的发布显著提升了其处理原本为人类考生设计的评估任务的能力。本研究正是针对这一向成熟生成式AI系统过渡的关键阶段进行的必要分析。具体而言，我们报告了GPT-4在三个Python课程中的表现（包含从简单选择题到多文件分布式代码库的复杂编程项目，共计599道习题），并将其与先前几代GPT模型进行了对比。此外，我们分析了GPT-4未能妥善处理的评估任务，以理解当前模型的局限性及其利用自动评分系统反馈的能力。研究发现，GPT模型从完全无法通过典型编程课程考核（原始GPT-3）发展到无需人类干预即可轻松通过课程（GPT-4）。尽管我们识别出GPT-4在处理选择题和编程练习时存在某些局限，但近年GPT模型的进步速度强烈表明，它们具有处理高等教育编程课程中广泛使用的几乎所有评估类型的潜力。教育工作者和机构可据此调整编程评估设计，并推动关于如何更新编程课程以反映最新技术发展的必要讨论。本研究提供的证据表明，编程讲师需要为一个新时代做好准备——在这个时代中，存在一种易于获取、广泛可用的技术，学习者可毫不费力地利用它获得当前被认为是有效的编程知识与技能评估的通过分数。

相关内容

GPT-4

关注 29

北京时间2023年3月15日凌晨，ChatGPT开发商OpenAI 发布了发布了全新的多模态预训练大模型 GPT-4，可以更可靠、更具创造力、能处理更细节的指令，根据图片和文字提示都能生成相应内容。具体来说来说，GPT-4 相比上一代的模型，实现了飞跃式提升：支持图像和文本输入，拥有强大的识图能力；大幅提升了文字输入限制，在ChatGPT模式下，GPT-4可以处理超过2.5万字的文本，可以处理一些更加细节的指令；回答准确性也得到了显著提高。

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日