Evaluating AI Grading on Real-World Handwritten College Mathematics: A Large-Scale Study Toward a Benchmark

Grading in large undergraduate STEM courses often yields minimal feedback due to heavy instructional workloads. We present a large-scale empirical study of AI grading on real, handwritten single-variable calculus work from UC Irvine. Using OCR-conditioned large language models with structured, rubric-guided prompting, our system produces scores and formative feedback for thousands of free-response quiz submissions from nearly 800 students. In a setting with no single ground-truth label, we evaluate performance against official teaching-assistant grades, student surveys, and independent human review, finding strong alignment with TA scoring and a large majority of AI-generated feedback rated as correct or acceptable across quizzes. Beyond calculus, this setting highlights core challenges in OCR-conditioned mathematical reasoning and partial-credit assessment. We analyze key failure modes, propose practical rubric- and prompt-design principles, and introduce a multi-perspective evaluation protocol for reliable, real-course deployment. Building on the dataset and evaluation framework developed here, we outline a standardized benchmark for AI grading of handwritten mathematics to support reproducible comparison and future research.

翻译：在大型本科STEM课程中，由于繁重的教学工作量，评分往往只能提供极少反馈。我们针对加州大学欧文分校真实的单变量微积分手写作业，开展了一项大规模AI评分实证研究。通过使用结合OCR条件的大语言模型，并采用结构化、评分标准引导的提示方法，我们的系统为近800名学生的数千份自由回答测验提交内容生成了分数和形成性反馈。在缺乏单一标准答案标签的场景下，我们通过官方助教评分、学生问卷调查和独立人工评审对系统性能进行评估，发现AI评分与助教评分高度一致，且AI生成反馈在各项测验中被判定为正确或可接受的比例占绝大多数。除微积分领域外，该研究场景凸显了OCR条件数学推理与部分得分评估中的核心挑战。我们分析了关键失效模式，提出了实用的评分标准与提示设计原则，并引入了一种多视角评估方案以实现可靠的现实课程部署。基于本研究构建的数据集与评估框架，我们提出了手写数学作业AI评分的标准化基准，以支持可复现的比较与未来研究。

相关内容

关注 7104

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

PaperOrchestra：一种面向自动化 AI 学术论文撰写的多智能体框架

专知会员服务

12+阅读 · 4月9日

【斯坦福博士论文】大语言模型的AI辅助评估

专知会员服务

31+阅读 · 2025年3月30日

如何做好AI研究？哈佛大学Pranav教授《AI研究经验》手册，259页pdf

专知会员服务

54+阅读 · 2025年1月5日

联合国教科文组织发布《生成式AI与教育未来》应用指南，48页pdf

专知会员服务

49+阅读 · 2023年9月13日