Intelligent Tutoring Systems (ITSs) often contain an automated feedback component, which provides a predefined feedback message to students when they detect a predefined error. To such a feedback component, we often resort to template-based approaches. These approaches require significant effort from human experts to detect a limited number of possible student errors and provide corresponding feedback. This limitation is exemplified in open-ended math questions, where there can be a large number of different incorrect errors. In our work, we examine the capabilities of large language models (LLMs) to generate feedback for open-ended math questions, similar to that of an established ITS that uses a template-based approach. We fine-tune both open-source and proprietary LLMs on real student responses and corresponding ITS-provided feedback. We measure the quality of the generated feedback using text similarity metrics. We find that open-source and proprietary models both show promise in replicating the feedback they see during training, but do not generalize well to previously unseen student errors. These results suggest that despite being able to learn the formatting of feedback, LLMs are not able to fully understand mathematical errors made by students.
翻译:智能辅导系统(ITS)通常包含自动化反馈组件,当检测到预定义错误时,会向学生提供预设的反馈信息。为实现此类反馈功能,我们通常采用基于模板的方法。这些方法需要领域专家投入大量精力来识别有限数量的潜在学生错误并提供相应反馈。这种局限性在开放式数学问题中尤为明显,因为可能存在大量不同类型的错误答案。本研究探讨了大型语言模型(LLMs)为开放式数学问题生成反馈的能力,并与采用模板化方法的成熟ITS进行对比。我们基于真实学生作答及对应ITS反馈数据,对开源和商业LLMs进行微调。通过文本相似度指标评估生成反馈的质量,发现开源模型与商业模型在复现训练数据中的反馈方面均表现良好,但对未见过的新错误类型泛化能力有限。结果表明,尽管LLMs能够学习反馈的表述格式,但尚不能完全理解学生犯下的数学错误本质。