Scaling educational assessment with large language models requires not just accuracy, but the ability to recognize when predictions are trustworthy. Instruction-tuned models tend to be overconfident, and their reliability deteriorates as curricula evolve, making fully autonomous deployment unsafe in high-stakes settings. We introduce CHiL(L)Grader, the first automated grading framework that incorporates calibrated confidence estimation into a human-in-the-loop workflow. Using post-hoc temperature scaling, confidence-based selective prediction, and continual learning, CHiL(L)Grader automates only high-confidence predictions while routing uncertain cases to human graders, and adapts to evolving rubrics and unseen questions. Across three short-answer grading datasets, CHiL(L)Grader automatically scores 35-65% of responses at expert-level quality (QWK >= 0.80). A QWK gap of 0.347 between accepted and rejected predictions confirms the effectiveness of the confidence-based routing. Each correction cycle strengthens the model's grading capability as it learns from teacher feedback. These results show that uncertainty quantification is key for reliable AI-assisted grading.
翻译:利用大语言模型扩展教育评估不仅需要准确性,更需要识别预测何时可信的能力。经过指令微调的模型往往过于自信,且其可靠性会随着课程演变而下降,这使得在高风险场景中完全自主部署存在安全隐患。我们提出了CHiL(L)Grader,这是首个将校准置信度估计融入人机协同工作流的自动化评分框架。通过使用事后温度缩放、基于置信度的选择性预测以及持续学习,CHiL(L)Grader仅自动处理高置信度预测,同时将不确定案例路由给人类评分员,并能适应不断更新的评分标准和未见问题。在三个短答案评分数据集上的实验表明,CHiL(L)Grader能以专家级质量(QWK >= 0.80)自动评分35-65%的作答。接受与拒绝预测之间0.347的QWK差距证实了基于置信度的路由机制的有效性。每个纠正周期都会通过从教师反馈中学习来增强模型的评分能力。这些结果表明,不确定性量化是实现可靠AI辅助评分的关键。