Large language models (LLMs), such as Codex, hold great promise in enhancing programming education by automatically generating feedback for students. We investigate using LLMs to generate feedback for fixing syntax errors in Python programs, a key scenario in introductory programming. More concretely, given a student's buggy program, our goal is to generate feedback comprising a fixed program along with a natural language explanation describing the errors/fixes, inspired by how a human tutor would give feedback. While using LLMs is promising, the critical challenge is to ensure high precision in the generated feedback, which is imperative before deploying such technology in classrooms. The main research question we study is: Can we develop LLMs-based feedback generation techniques with a tunable precision parameter, giving educators quality control over the feedback that students receive? To this end, we introduce PyFiXV, our technique to generate high-precision feedback powered by Codex. The key idea behind PyFiXV is to use a novel run-time validation mechanism to decide whether the generated feedback is suitable for sharing with the student; notably, this validation mechanism also provides a precision knob to educators. We perform an extensive evaluation using two real-world datasets of Python programs with syntax errors and show the efficacy of PyFiXV in generating high-precision feedback.
翻译:大语言模型(LLM),如Codex,在通过自动生成学生反馈来增强编程教育方面展现出巨大潜力。我们研究了如何利用LLM为Python程序中的语法错误生成反馈,这是入门编程的关键场景。具体而言,针对学生存在错误的程序,我们的目标是生成包含修正后程序及自然语言解释(描述错误/修正方法)的反馈,这一设计灵感来源于真人辅导员的反馈方式。尽管使用LLM前景广阔,但关键挑战在于确保生成反馈的高精度——这是在课堂中部署此类技术前的必要条件。本研究探讨的核心问题是:能否开发基于LLM的反馈生成技术,使其具备可调精度参数,从而让教育者能够对学生所获反馈进行质量控制?为此,我们提出了PyFiXV——一种由Codex驱动的高精度反馈生成技术。其核心思想是通过新型运行时验证机制判断生成反馈是否适合展示给学生;特别地,该验证机制还为教育者提供了精度调节旋钮。我们使用两个包含语法错误的Python程序真实数据集进行了广泛评估,结果表明PyFiXV在生成高精度反馈方面具有有效性。