Automated grading systems have enabled scalable assessment for many response types, but handwritten mathematics remains a barrier due to the complexity of multi-step solutions. Vision-capable large language models (LLMs) offer new opportunities here, yet their reliability in authentic instructional settings remains poorly understood. We present an empirical evaluation of an LLM-based grader for handwritten mathematical work using instructor-defined rubrics. Extending a prior pipeline for typed responses, we integrate transcription and rubric-based evaluation of photographic submissions within a single LLM call, evaluating on student work from two university STEM courses. Comparing AI grading decisions against human-assigned ground truth at the rubric-item level, we observe high overall accuracy, with most errors -- 87\% in the best model -- attributable to transcription failures rather than rubric misapplication. We categorize common error modes, including image quality issues, hallucinated content, and incorrect handling of equivalent expressions. These findings highlight both the promise and limitations of LLM-based grading for handwritten mathematics, providing guidance for system design, prompt refinement, and deployment in educational settings.
翻译:自动评分系统已能够为多种答题类型提供可扩展的评估,但手写数学答案的多步骤解题复杂性仍是主要障碍。具备视觉能力的大语言模型为此提供了新机遇,然而其在真实教学场景中的可靠性仍鲜有研究。本文提出一种基于大语言模型的评分器,用于手写数学作业的实证评估——该评分器采用教师定义的评分标准。通过扩展前期针对打字答案的流程,我们在单次大语言模型调用中整合了照片提交物的转录与基于评分标准的评估功能,并在两门大学STEM课程的学生作业上进行了测试。在评分项级别将AI评分结果与人工标注的真实标签进行比较后,我们发现整体准确率较高,且最佳模型中87%的误差可归因于转录失败而非评分标准误用。我们归纳了常见错误类型,包括图像质量问题、虚构内容及对等价表达式的错误处理。这些发现既揭示了基于大语言模型的评分系统在手写数学答案处理中的潜力与局限,也为系统设计、提示优化及教育场景部署提供了指导。