This paper investigates the performance of the Large Language Models (LLMs) ChatGPT-3.5 and GPT-4 in solving introductory programming tasks. Based on the performance, implications for didactic scenarios and assessment formats utilizing LLMs are derived. For the analysis, 72 Python tasks for novice programmers were selected from the free site CodingBat. Full task descriptions were used as input to the LLMs, while the generated replies were evaluated using CodingBat's unit tests. In addition, the general availability of textual explanations and program code was analyzed. The results show high scores of 94.4 to 95.8% correct responses and reliable availability of textual explanations and program code, which opens new ways to incorporate LLMs into programming education and assessment.
翻译:本文探究了大型语言模型ChatGPT-3.5与GPT-4在解决编程入门任务上的性能表现。基于此性能分析,研究推导了利用大型语言模型的教学场景与评估形式的相关启示。为开展分析,我们从免费网站CodingBat中选取了面向编程初学者的72道Python任务。将完整任务描述作为大型语言模型的输入,并通过CodingBat的单元测试对生成回复进行评价。此外,还对文本解释与程序代码的通用可用性进行了分析。结果显示,其正确回答率高达94.4%至95.8%,且文本解释与程序代码的可用性可靠,这为将大型语言模型融入编程教学与评估开辟了新途径。