Automatically generating feedback via large language models (LLMs) in intelligent tutoring systems and online learning platforms has the potential to improve the learning outcomes of many students. However, both feedback generation and evaluation are challenging: feedback content has to be valid especially in subjects like math, which requires models to understand the problem, the solution, and where the student's error lies. Feedback also has to be pedagogically valid to reflect effective tutoring strategies, such as explaining possible misconceptions and encouraging the student, among other desirable features. In this work, we address both problems of automatically generating and evaluating feedback while considering both correctness and alignment. First, we propose a rubric for evaluating math feedback and show that GPT-4 is able to effectively use it to annotate human-written and LLM-generated feedback. Second, we propose a framework for feedback generation that optimizes both correctness and alignment using reinforcement learning (RL). Specifically, we use GPT-4's annotations to create preferences over feedback pairs in an augmented dataset for training via direct preference optimization (DPO). We show that our methods significantly increase the correctness and alignment of generated feedback with Llama 2, an open-source LLM, qualitatively analyze our generation and evaluation systems using case studies, and outline several areas for future work.
翻译:在智能辅导系统与在线学习平台中,通过大语言模型自动生成反馈具有提升广大学生学习成效的潜力。然而,反馈的生成与评估均面临挑战:反馈内容必须具备有效性,尤其在数学等学科中,这要求模型能够理解问题、解决方案以及学生错误所在。反馈还需具备教学有效性,以体现有效的辅导策略,例如解释可能的误解、鼓励学生以及其他理想特征。本研究同时解决了自动生成与评估反馈的问题,并兼顾了正确性与对齐性。首先,我们提出了一套用于评估数学反馈的评分标准,并证明GPT-4能够有效利用该标准对人工撰写及大语言模型生成的反馈进行标注。其次,我们提出了一个反馈生成框架,该框架通过强化学习优化正确性与对齐性。具体而言,我们利用GPT-4的标注在增强数据集中创建反馈对的偏好,用于通过直接偏好优化进行训练。实验表明,我们的方法显著提升了开源大语言模型Llama 2所生成反馈的正确性与对齐性;我们通过案例研究对生成与评估系统进行了定性分析,并展望了若干未来研究方向。