Generative AI and large language models hold great promise in enhancing computing education by powering next-generation educational technologies for introductory programming. Recent works have studied these models for different scenarios relevant to programming education; however, these works are limited for several reasons, as they typically consider already outdated models or only specific scenario(s). Consequently, there is a lack of a systematic study that benchmarks state-of-the-art models for a comprehensive set of programming education scenarios. In our work, we systematically evaluate two models, ChatGPT (based on GPT-3.5) and GPT-4, and compare their performance with human tutors for a variety of scenarios. We evaluate using five introductory Python programming problems and real-world buggy programs from an online platform, and assess performance using expert-based annotations. Our results show that GPT-4 drastically outperforms ChatGPT (based on GPT-3.5) and comes close to human tutors' performance for several scenarios. These results also highlight settings where GPT-4 still struggles, providing exciting future directions on developing techniques to improve the performance of these models.
翻译:生成式人工智能与大型语言模型在通过赋能下一代教育技术以增强计算机入门编程教学方面展现出巨大潜力。近期研究已探索了这些模型在编程教育相关不同场景中的应用;然而,现有研究存在局限性,通常仅考虑已过时的模型或特定场景,缺乏对最先进模型在全面编程教育场景下的系统性基准测试。本研究系统评估了两个模型——ChatGPT(基于GPT-3.5)与GPT-4,并将其与人类导师在多种场景下的表现进行对比。我们使用五道Python入门编程题目及来自在线平台的真实错误程序开展实验,并基于专家标注评估模型性能。结果表明,GPT-4显著优于ChatGPT(基于GPT-3.5),且在多个场景中接近人类导师的表现。这些结果同时揭示了GPT-4仍面临挑战的特定场景,为提升此类模型性能的技术研发提供了令人振奋的未来方向。