Grading in large undergraduate STEM courses often yields minimal feedback due to heavy instructional workloads. We present a large-scale empirical study of AI grading on real, handwritten single-variable calculus work from UC Irvine. Using OCR-conditioned large language models with structured, rubric-guided prompting, our system produces scores and formative feedback for thousands of free-response quiz submissions from nearly 800 students. In a setting with no single ground-truth label, we evaluate performance against official teaching-assistant grades, student surveys, and independent human review, finding strong alignment with TA scoring and a large majority of AI-generated feedback rated as correct or acceptable across quizzes. Beyond calculus, this setting highlights core challenges in OCR-conditioned mathematical reasoning and partial-credit assessment. We analyze key failure modes, propose practical rubric- and prompt-design principles, and introduce a multi-perspective evaluation protocol for reliable, real-course deployment. Building on the dataset and evaluation framework developed here, we outline a standardized benchmark for AI grading of handwritten mathematics to support reproducible comparison and future research.
翻译:在大型本科STEM课程中,由于繁重的教学工作量,评分往往只能提供极少反馈。我们针对加州大学欧文分校真实的单变量微积分手写作业,开展了一项大规模AI评分实证研究。通过使用结合OCR条件的大语言模型,并采用结构化、评分标准引导的提示方法,我们的系统为近800名学生的数千份自由回答测验提交内容生成了分数和形成性反馈。在缺乏单一标准答案标签的场景下,我们通过官方助教评分、学生问卷调查和独立人工评审对系统性能进行评估,发现AI评分与助教评分高度一致,且AI生成反馈在各项测验中被判定为正确或可接受的比例占绝大多数。除微积分领域外,该研究场景凸显了OCR条件数学推理与部分得分评估中的核心挑战。我们分析了关键失效模式,提出了实用的评分标准与提示设计原则,并引入了一种多视角评估方案以实现可靠的现实课程部署。基于本研究构建的数据集与评估框架,我们提出了手写数学作业AI评分的标准化基准,以支持可复现的比较与未来研究。