Stepwise Verification and Remediation of Student Reasoning Errors with Large Language Model Tutors

from arxiv, Preprint. Nico Daheim and Jakub Macina contributed equally. Code and dataset can be found under: https://github.com/eth-lre/verify-then-generate

Large language models (LLMs) present an opportunity to scale high-quality personalized education to all. A promising approach towards this means is to build dialog tutoring models that scaffold students' problem-solving. However, even though existing LLMs perform well in solving reasoning questions, they struggle to precisely detect student's errors and tailor their feedback to these errors. Inspired by real-world teaching practice where teachers identify student errors and customize their response based on them, we focus on verifying student solutions and show how grounding to such verification improves the overall quality of tutor response generation. We collect a dataset of 1K stepwise math reasoning chains with the first error step annotated by teachers. We show empirically that finding the mistake in a student solution is challenging for current models. We propose and evaluate several verifiers for detecting these errors. Using both automatic and human evaluation we show that the student solution verifiers steer the generation model towards highly targeted responses to student errors which are more often correct with less hallucinations compared to existing baselines.

翻译：大语言模型（LLMs）为实现高质量个性化教育的规模化普及提供了契机。构建能够辅助学生问题解决的对话式辅导模型是实现这一目标的有效途径。然而，尽管现有LLMs在解答推理问题方面表现良好，它们仍难以精准识别学生的具体错误并据此提供针对性反馈。受现实教学实践中教师通过识别学生错误来定制反馈的启发，本研究聚焦于对学生解题过程的验证，并论证基于此类验证的生成机制如何提升辅导反馈的整体质量。我们收集了包含1000条分步骤数学推理链的数据集，其中每个推理链的第一个错误步骤均由教师标注。实证研究表明，现有模型在定位学生解题错误方面仍面临挑战。我们提出并评估了多种用于检测此类错误的验证器。通过自动评估与人工评估相结合的方式，我们证明学生解题验证器能够引导生成模型针对学生错误产生高度定向的反馈响应，相较于现有基线方法，这种反馈具有更高的正确率与更低的幻觉生成倾向。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/