Generative AI and large language models hold great promise in enhancing computing education by powering next-generation educational technologies for introductory programming. Recent works have studied these models for different scenarios relevant to programming education; however, these works are limited for several reasons, as they typically consider already outdated models or only specific scenario(s). Consequently, there is a lack of a systematic study that benchmarks state-of-the-art models for a comprehensive set of programming education scenarios. In our work, we systematically evaluate two models, ChatGPT (based on GPT-3.5) and GPT-4, and compare their performance with human tutors for a variety of scenarios. We evaluate using five introductory Python programming problems and real-world buggy programs from an online platform, and assess performance using expert-based annotations. Our results show that GPT-4 drastically outperforms ChatGPT (based on GPT-3.5) and comes close to human tutors' performance for several scenarios. These results also highlight settings where GPT-4 still struggles, providing exciting future directions on developing techniques to improve the performance of these models.
翻译:生成式AI与大型语言模型在增强计算教育方面展现出巨大潜力,能够为入门编程领域打造新一代教育技术。近期研究已针对编程教育的不同场景对这些模型进行了探讨;然而,这些工作存在局限性,通常仅涉及已过时的模型或特定场景。因此,目前缺乏一项系统性研究来针对编程教育的综合场景对先进模型进行基准测试。本研究系统评估了ChatGPT(基于GPT-3.5)和GPT-4两种模型,并在多种场景下将其性能与人类辅导员进行比较。我们利用五道Python编程入门题目及在线平台中的真实错误程序进行测试,并基于专家标注评估性能。结果表明,GPT-4在多项场景中显著优于ChatGPT(基于GPT-3.5),且接近人类辅导员的性能水平。同时,这些结果也揭示了GPT-4仍面临挑战的特定场景,为开发提升模型性能的技术提供了未来研究方向。