Correcting handwritten exams by hand is time-consuming and error-prone, particularly for large cohorts, while fully digital exams tend to force a didactic narrowing towards closed question formats. A practical middle ground keeps paper-based, problem-oriented tasks but records the assessment-relevant answers as single capital letters in a table that a machine can read. The open question is whether this reading can be made accurate and, above all, fair enough for unsupervised grading. Earlier automated approaches reached only about 88%--91% recognition -- too low -- and failed on the cases that matter most: answers placed outside the cell, crossed out, or written in cursive. We show that general-purpose vision-language foundation models (VLMs), which interpret the page rather than match pixel templates, close this gap. On a benchmark of 61 anonymised exams (3141 answer positions) the best model reaches 98.4% accuracy, well above the previous baseline. Crucially, we centre the evaluation on fairness: we distinguish false negatives (a correct answer marked wrong, which disadvantages the student) from false positives, and a lightweight prompt that supplies the reference solution as context lowers the false-negative rate to 0.58%. Under an exemplary grading scheme only three of the 61 exams would be graded worse, all caught by a student self-review step. Fully automated, fairness-aware exam grading at scale is therefore defensible; we release the anonymised benchmark to support reproducibility.
翻译:人工批改手写考试既耗时又易出错,尤其面对大规模考生群体时更是如此;而完全数字化的考试则容易导致教学法局限于封闭式问题形式。一种实用的折中方案是保留基于纸笔的问题导向型任务,但将评估相关的答案以单个大写字母形式记录在机器可读的表格中。尚待解决的关键问题在于:这种识别方法能否达到足够准确——尤其是足够公平——以实现无监督评分。先前的自动化方法仅能达到约88%-91%的识别率——这个精度显然不足——且在最关键的情形下失效:答案超出单元格、被划掉或以草书书写。我们证明,通用视觉-语言基础模型(VLMs)能够弥合这一差距——这类模型通过解读页面内容而非匹配像素模板进行识别。在包含61份匿名试卷(3141个答案位置)的基准测试中,最优模型达到了98.4%的准确率,远超此前的基线水平。更为关键的是,我们将评估重心聚焦于公平性:我们区分了假阴性(将正确答案判为错误,对学生不利)与假阳性,并通过一个提供参考答案作为背景信息的轻量级提示,将假阴性率降至0.58%。在示范性评分方案下,61份试卷中仅3份可能获得更差评分,且这些案例均可通过学生自查环节拦截。因此,大规模全自动、公平感知的考试评分是可行的;我们公开该匿名化基准测试以支持可复现性。