Large language models (LLMs) present an opportunity to scale high-quality personalized education to all. A promising approach towards this means is to build dialog tutoring models that scaffold students' problem-solving. However, even though existing LLMs perform well in solving reasoning questions, they struggle to precisely detect student's errors and tailor their feedback to these errors. Inspired by real-world teaching practice where teachers identify student errors and customize their response based on them, we focus on verifying student solutions and show how grounding to such verification improves the overall quality of tutor response generation. We collect a dataset of 1K stepwise math reasoning chains with the first error step annotated by teachers. We show empirically that finding the mistake in a student solution is challenging for current models. We propose and evaluate several verifiers for detecting these errors. Using both automatic and human evaluation we show that the student solution verifiers steer the generation model towards highly targeted responses to student errors which are more often correct with less hallucinations compared to existing baselines.
翻译:大语言模型(LLMs)为实现高质量个性化教育的规模化普及提供了契机。构建能够辅助学生问题解决的对话式辅导模型是实现这一目标的有效途径。然而,尽管现有LLMs在解答推理问题方面表现良好,它们仍难以精准识别学生的具体错误并据此提供针对性反馈。受现实教学实践中教师通过识别学生错误来定制反馈的启发,本研究聚焦于对学生解题过程的验证,并论证基于此类验证的生成机制如何提升辅导反馈的整体质量。我们收集了包含1000条分步骤数学推理链的数据集,其中每个推理链的第一个错误步骤均由教师标注。实证研究表明,现有模型在定位学生解题错误方面仍面临挑战。我们提出并评估了多种用于检测此类错误的验证器。通过自动评估与人工评估相结合的方式,我们证明学生解题验证器能够引导生成模型针对学生错误产生高度定向的反馈响应,相较于现有基线方法,这种反馈具有更高的正确率与更低的幻觉生成倾向。