Generative AI and large language models hold great promise in enhancing computing education by powering next-generation educational technologies for introductory programming. Recent works have studied these models for different scenarios relevant to programming education; however, these works are limited for several reasons, as they typically consider already outdated models or only specific scenario(s). Consequently, there is a lack of a systematic study that benchmarks state-of-the-art models for a comprehensive set of programming education scenarios. In our work, we systematically evaluate two models, ChatGPT (based on GPT-3.5) and GPT-4, and compare their performance with human tutors for a variety of scenarios. We evaluate using five introductory Python programming problems and real-world buggy programs from an online platform, and assess performance using expert-based annotations. Our results show that GPT-4 drastically outperforms ChatGPT (based on GPT-3.5) and comes close to human tutors' performance for several scenarios. These results also highlight settings where GPT-4 still struggles, providing exciting future directions on developing techniques to improve the performance of these models.
翻译:生成式人工智能与大型语言模型在通过赋能下一代编程教育技术来提升计算教育方面展现出巨大潜力。近期研究已针对编程教育相关的不同场景对这些模型进行了探讨;然而,这些研究存在局限性,因为它们通常仅考虑已过时的模型或仅针对特定场景。因此,目前缺乏一项系统性的基准测试研究,对前沿模型在编程教育综合场景中的表现进行评估。在我们的工作中,我们系统性地评估了ChatGPT(基于GPT-3.5)和GPT-4两个模型,并将其性能与人类导师在多种场景下进行了比较。我们使用五个入门级Python编程问题以及来自在线平台的真实缺陷程序进行评测,并基于专家标注结果评估性能。结果表明,GPT-4在多项场景中显著优于ChatGPT(基于GPT-3.5),且接近人类导师的表现。这些结果也揭示了GPT-4仍存在不足的领域,为开发提升这些模型性能的技术提供了未来研究方向。