Assessing student handwritten scratchwork is crucial for personalized educational feedback but presents unique challenges due to diverse handwriting, complex layouts, and varied problem-solving approaches. Existing educational NLP primarily focuses on textual responses and neglects the complexity and multimodality inherent in authentic handwritten scratchwork. Current multimodal large language models (MLLMs) excel at visual reasoning but typically adopt an "examinee perspective", prioritizing generating correct answers rather than diagnosing student errors. To bridge these gaps, we introduce ScratchMath, a novel benchmark specifically designed for explaining and classifying errors in authentic handwritten mathematics scratchwork. Our dataset comprises 1,720 mathematics samples from Chinese primary and middle school students, supporting two key tasks: Error Cause Explanation (ECE) and Error Cause Classification (ECC), with seven defined error types. The dataset is meticulously annotated through rigorous human-machine collaborative approaches involving multiple stages of expert labeling, review, and verification. We systematically evaluate 16 leading MLLMs on ScratchMath, revealing significant performance gaps relative to human experts, especially in visual recognition and logical reasoning. Proprietary models notably outperform open-source models, with large reasoning models showing strong potential for error explanation. All evaluation data and frameworks are publicly available to facilitate further research.
翻译:评估学生的手写草稿对于提供个性化教育反馈至关重要,但由于手写多样性、复杂布局及解题思路的差异,这一任务面临独特挑战。现有教育自然语言处理技术主要聚焦于文本型作答,忽视了真实手写草稿固有复杂性与多模态特性。当前多模态大语言模型虽擅长视觉推理,但通常采用“考生视角”,优先追求生成正确答案而非诊断学生错误。为弥补这一空白,我们提出ScratchMath——一个专为解释与分类真实手写数学草稿错误而设计的新型基准测试集。该数据集包含1720份中国中小学生数学样本,支持错误原因解释与错误原因分类两项核心任务,并定义了七类错误类型。数据通过严格的人机协同方法精心标注,历经多轮专家标注、审核与验证。我们在ScratchMath上系统评估了16种主流多模态大语言模型,发现其与人类专家在视觉识别与逻辑推理等维度存在显著性能差距。闭源模型表现明显优于开源模型,而大型推理模型在错误解释任务中展现出较强潜力。所有评估数据与框架均已公开,以推动后续研究。