Large language models trained on code (LLMCs), such as Codex, hold great promise in enhancing programming education by automatically generating feedback for students. We investigate using LLMCs to generate feedback for fixing syntax errors in Python programs, a key scenario in introductory programming. More concretely, given a student's buggy program, our goal is to generate feedback comprising a fixed program along with a natural language explanation describing the errors/fixes, inspired by how a human tutor would give feedback. While using LLMCs is promising, the critical challenge is to ensure high precision in the generated feedback, which is imperative before deploying such technology in classrooms. The main research question we study is: Can we develop LLMCs-based feedback generation techniques with a tunable precision parameter, giving educators quality control over the feedback that students receive? To this end, we introduce PyFiXV, our technique to generate high-precision feedback powered by Codex. The key idea behind PyFiXV is to use a novel run-time validation mechanism to decide whether the generated feedback is suitable for sharing with the student; notably, this validation mechanism also provides a precision knob to educators. We perform an extensive evaluation using two real-world datasets of Python programs with syntax errors and show the efficacy of PyFiXV in generating high-precision feedback.
翻译:基于代码训练的大型语言模型(如Codex)在通过自动生成学生反馈提升编程教育方面具有巨大潜力。我们研究利用此类模型为Python程序语法错误修复生成反馈——这正是入门级编程教学的关键场景。具体而言,给定学生存在缺陷的程序,我们的目标是生成包含修正后程序的反馈,并辅以自然语言解释来描述错误与修复方法,其形式模拟人类导师的反馈模式。尽管使用大型语言模型前景广阔,但关键挑战在于确保生成反馈的高精度——这是此类技术应用于课堂前必须解决的先决条件。我们研究的核心问题是:能否开发出具有可调精度参数的基于大型语言模型的反馈生成技术,为教育工作者提供对学生接收反馈的质量控制?为此,我们提出PyFiXV技术——一种基于Codex的高精度反馈生成方案。PyFiXV的核心创新在于采用新型运行时验证机制来决定生成的反馈是否适合向学生展示;值得注意的是,该验证机制还为教育工作者提供了精度调节旋钮。我们采用两个包含语法错误的Python程序真实数据集进行广泛评估,实验结果表明PyFiXV在生成高精度反馈方面具有显著有效性。