The emergence of large language models (LLMs) has transformed research and practice in a wide range of domains. Within the computing education research (CER) domain, LLMs have received plenty of attention especially in the context of learning programming. Much of the work on LLMs in CER has however focused on applying and evaluating proprietary models. In this article, we evaluate the efficiency of open-source LLMs in generating high-quality feedback for programming assignments, and in judging the quality of the programming feedback, contrasting the results against proprietary models. Our evaluations on a dataset of students' submissions to Python introductory programming exercises suggest that the state-of-the-art open-source LLMs (Meta's Llama3) are almost on-par with proprietary models (GPT-4o) in both the generation and assessment of programming feedback. We further demonstrate the efficiency of smaller LLMs in the tasks, and highlight that there are a wide range of LLMs that are accessible even for free for educators and practitioners.
翻译:大型语言模型(LLMs)的出现已广泛改变了多个领域的研究与实践。在计算教育研究(CER)领域,LLMs尤其在编程学习情境中受到了大量关注。然而,CER中关于LLMs的研究大多集中于应用和评估专有模型。本文评估了开源LLMs在生成高质量编程作业反馈以及评判编程反馈质量方面的效能,并将结果与专有模型进行对比。基于学生提交的Python入门编程练习数据集进行的评估表明,当前最先进的开源LLMs(Meta的Llama3)在编程反馈的生成与评估方面已接近专有模型(GPT-4o)的水平。我们进一步展示了较小规模LLMs在这些任务中的效能,并强调存在大量LLMs可供教育工作者和实践者免费使用。