Accurate transcription of handwritten mathematics is crucial for educational AI systems, yet current benchmarks fail to evaluate this capability properly. Most prior studies focus on single-line expressions and rely on lexical metrics such as BLEU, which fail to assess the semantic reasoning across multi-line student solutions. In this paper, we present the first systematic study of multi-line handwritten math Optical Character Recognition (OCR), revealing a critical failure mode of Vision-Language Models (VLMs): over-correction. Instead of faithfully transcribing a student's work, these models often "fix" errors, thereby hiding the very mistakes an educational assessment aims to detect. To address this, we propose PINK (Penalized INK-based score), a semantic evaluation metric that leverages a Large Language Model (LLM) for rubric-based grading and explicitly penalizes over-correction. Our comprehensive evaluation of 15 state-of-the-art VLMs on the FERMAT dataset reveals substantial ranking reversals compared to BLEU: models like GPT-4o are heavily penalized for aggressive over-correction, whereas Gemini 2.5 Flash emerges as the most faithful transcriber. Furthermore, human expert studies show that PINK aligns significantly better with human judgment (55.0% preference over BLEU's 39.5%), providing a more reliable evaluation framework for handwritten math OCR in educational settings.
翻译:手写数学公式的准确转录对于教育AI系统至关重要,但现有基准测试未能恰当评估该能力。多数先前研究聚焦于单行表达式,依赖BLEU等词法指标,难以评估跨多行学生解答的语义推理。本文首次系统研究多行手写数学光学字符识别,揭示视觉语言模型的关键失效模式:过度纠正。这些模型不是忠实地转录学生作业,而是常常"修正"错误,从而隐藏了教育评估旨在检测的原始失误。针对此问题,我们提出PINK(基于惩罚性INK得分),一种利用大语言模型进行评分标准分级的语义评估指标,明确惩罚过度纠正。我们在FERMAT数据集上对15个最先进的视觉语言模型进行全面评估,发现其排名相比BLEU出现显著翻转:GPT-4o等模型因激进的过度纠正受到严厉惩罚,而Gemini 2.5 Flash成为最忠实的转录器。此外,人类专家研究表明,PINK与人类判断的一致性显著更高(55.0%偏好度,优于BLEU的39.5%),为教育场景中的手写数学OCR提供了更可靠的评估框架。