Towards Fully Automated Exam Grading: Fairness-Aware Recognition of Handwritten Answers with Foundation Models

Correcting handwritten exams by hand is time-consuming and error-prone, particularly for large cohorts, while fully digital exams tend to force a didactic narrowing towards closed question formats. A practical middle ground keeps paper-based, problem-oriented tasks but records the assessment-relevant answers as single capital letters in a table that a machine can read. The open question is whether this reading can be made accurate and, above all, fair enough for unsupervised grading. Earlier automated approaches reached only about 88%--91% recognition -- too low -- and failed on the cases that matter most: answers placed outside the cell, crossed out, or written in cursive. We show that general-purpose vision-language foundation models (VLMs), which interpret the page rather than match pixel templates, close this gap. On a benchmark of 61 anonymised exams (3141 answer positions) the best model reaches 98.4% accuracy, well above the previous baseline. Crucially, we centre the evaluation on fairness: we distinguish false negatives (a correct answer marked wrong, which disadvantages the student) from false positives, and a lightweight prompt that supplies the reference solution as context lowers the false-negative rate to 0.58%. Under an exemplary grading scheme only three of the 61 exams would be graded worse, all caught by a student self-review step. Fully automated, fairness-aware exam grading at scale is therefore defensible; we release the anonymised benchmark to support reproducibility.

翻译：人工批改手写考试既耗时又易出错，尤其面对大规模考生群体时更是如此；而完全数字化的考试则容易导致教学法局限于封闭式问题形式。一种实用的折中方案是保留基于纸笔的问题导向型任务，但将评估相关的答案以单个大写字母形式记录在机器可读的表格中。尚待解决的关键问题在于：这种识别方法能否达到足够准确——尤其是足够公平——以实现无监督评分。先前的自动化方法仅能达到约88%-91%的识别率——这个精度显然不足——且在最关键的情形下失效：答案超出单元格、被划掉或以草书书写。我们证明，通用视觉-语言基础模型（VLMs）能够弥合这一差距——这类模型通过解读页面内容而非匹配像素模板进行识别。在包含61份匿名试卷（3141个答案位置）的基准测试中，最优模型达到了98.4%的准确率，远超此前的基线水平。更为关键的是，我们将评估重心聚焦于公平性：我们区分了假阴性（将正确答案判为错误，对学生不利）与假阳性，并通过一个提供参考答案作为背景信息的轻量级提示，将假阴性率降至0.58%。在示范性评分方案下，61份试卷中仅3份可能获得更差评分，且这些案例均可通过学生自查环节拦截。因此，大规模全自动、公平感知的考试评分是可行的；我们公开该匿名化基准测试以支持可复现性。