With the recent rapid increase in digitization across all major industries, acquiring programming skills has increased the demand for introductory programming courses. This has further resulted in universities integrating programming courses into a wide range of curricula, including not only technical studies but also business and management fields of study. Consequently, additional resources are needed for teaching, grading, and tutoring students with diverse educational backgrounds and skills. As part of this, Automated Programming Assessment Systems (APASs) have emerged, providing scalable and high-quality assessment systems with efficient evaluation and instant feedback. Commonly, APASs heavily rely on predefined unit tests for generating feedback, often limiting the scope and level of detail of feedback that can be provided to students. With the rise of Large Language Models (LLMs) in recent years, new opportunities have emerged as these technologies can enhance feedback quality and personalization. To investigate how different feedback mechanisms in APASs are perceived by students, and how effective they are in supporting problem-solving, we have conducted a large-scale study with over 200 students from two different universities. Specifically, we compare baseline Compiler Feedback, standard Unit Test Feedback, and advanced LLM-based Feedback regarding perceived quality and impact on student performance. Results indicate that while students rate unit test feedback as the most helpful, AI-generated feedback leads to significantly better performances. These findings suggest combining unit tests and AI-driven guidance to optimize automated feedback mechanisms and improve learning outcomes in programming education.
翻译:随着近期各大主要行业数字化进程的快速推进,编程技能的获取增加了对入门编程课程的需求。这进一步促使大学将编程课程整合到广泛的课程体系中,不仅涵盖技术学科,还包括商业与管理研究领域。因此,需要额外资源来教授、评分和辅导具有不同教育背景与技能水平的学生。在此背景下,自动编程评估系统应运而生,提供了可扩展且高质量的评估系统,具备高效评估与即时反馈功能。通常,自动编程评估系统严重依赖预定义的单元测试来生成反馈,这往往限制了可向学生提供的反馈范围和详细程度。近年来,随着大语言模型的兴起,这些技术为提升反馈质量与个性化水平带来了新的机遇。为探究学生对自动编程评估系统中不同反馈机制的感知差异及其在支持问题解决方面的有效性,我们在两所大学对200余名学生开展了大规模研究。具体而言,我们从感知质量和对学生表现的影响两个维度,对比了基础编译器反馈、标准单元测试反馈与基于大语言模型的高级反馈。结果表明:虽然学生认为单元测试反馈最有帮助,但人工智能生成的反馈能显著提升学生表现。这些发现建议将单元测试与人工智能驱动的指导相结合,以优化自动反馈机制并改善编程教育的学习成效。