Providing timely and individualised feedback on handwritten student work is highly beneficial for learning but difficult to achieve at scale. This challenge has become more pressing as generative AI undermines the reliability of take-home assessments, shifting emphasis toward supervised, in-class evaluation. We present a scalable, end-to-end workflow for LLM-assisted grading of short, pen-and-paper assessments. The workflow spans (1) constructing solution keys, (2) developing detailed rubric-style grading keys used to guide the LLM, and (3) a grading procedure that combines automated scanning and anonymisation, multi-pass LLM scoring, automated consistency checks, and mandatory human verification. We deploy the system in two undergraduate mathematics courses using six low-stakes in-class tests. Empirically, LLM assistance reduces grading time by approximately 23% while achieving agreement comparable to, and in several cases tighter than, fully manual grading. Occasional model errors occur but are effectively contained by the hybrid design. Overall, our results show that carefully embedded human-in-the-loop LLM grading can substantially reduce workload while maintaining fairness and accuracy.
翻译:对手写学生作业提供及时且个性化的反馈对学习极为有益,但难以大规模实现。随着生成式人工智能削弱了课后评估的可靠性,评估重点转向有监督的课堂内测试,这一挑战变得更为紧迫。我们提出了一种可扩展的端到端工作流程,用于大型语言模型辅助批改简短纸笔评估。该工作流程涵盖:(1) 构建参考答案,(2) 制定详细的评分细则式评分标准以指导大型语言模型,以及 (3) 一个结合了自动扫描与匿名化处理、多轮大型语言模型评分、自动一致性检查及强制性人工核验的批改程序。我们在两门本科数学课程中部署了该系统,使用了六次低风险课堂测验。实证表明,大型语言模型辅助使批改时间减少了约23%,同时达到与完全人工批改相当、且在多个案例中更为一致的评分者间一致性。虽然偶尔会出现模型错误,但混合设计能有效控制这些错误。总体而言,我们的结果表明,精心设计的人机协同大型语言模型批改系统能在保持公平性和准确性的同时,显著减轻工作负担。