Ever since Large Language Models (LLMs) and related applications have become broadly available, several studies investigated their potential for assisting educators and supporting students in higher education. LLMs such as Codex, GPT-3.5, and GPT 4 have shown promising results in the context of large programming courses, where students can benefit from feedback and hints if provided timely and at scale. This paper explores the quality of GPT-4 Turbo's generated output for prompts containing both the programming task specification and a student's submission as input. Two assignments from an introductory programming course were selected, and GPT-4 was asked to generate feedback for 55 randomly chosen, authentic student programming submissions. The output was qualitatively analyzed regarding correctness, personalization, fault localization, and other features identified in the material. Compared to prior work and analyses of GPT-3.5, GPT-4 Turbo shows notable improvements. For example, the output is more structured and consistent. GPT-4 Turbo can also accurately identify invalid casing in student programs' output. In some cases, the feedback also includes the output of the student program. At the same time, inconsistent feedback was noted such as stating that the submission is correct but an error needs to be fixed. The present work increases our understanding of LLMs' potential, limitations, and how to integrate them into e-assessment systems, pedagogical scenarios, and instructing students who are using applications based on GPT-4.
翻译:自大型语言模型及相关应用广泛普及以来,多项研究探讨了其在辅助教育工作者和支持高等教育学生方面的潜力。诸如Codex、GPT-3.5和GPT-4等大型语言模型在大型编程课程中展现出令人鼓舞的结果——若能及时且大规模地提供反馈与提示,学生将从中受益。本文探究了GPT-4 Turbo针对包含编程任务说明与学生提交内容作为输入的提示所生成输出的质量。我们从一门编程入门课程中选取了两个作业任务,要求GPT-4为55份随机选取的真实学生编程作业生成反馈,并从正确性、个性化、故障定位及材料中识别的其他特征维度对输出进行定性分析。相较于先前对GPT-3.5的研究与分析,GPT-4 Turbo展现出显著改进:例如输出结构更清晰、一致性更强;能准确识别学生程序输出中的无效大小写;部分反馈还包含学生程序的输出结果。但同时发现不一致的反馈现象,例如指出提交内容正确但需修正某个错误。本研究深化了我们对大型语言模型潜力与局限性的认识,并探讨了如何将其整合至电子评估系统、教学场景及指导使用基于GPT-4应用的学生群体。