When Can We Trust LLM Graders? Calibrating Confidence for Automated Assessment

Large Language Models (LLMs) show promise for automated grading, but their outputs can be unreliable. Rather than improving grading accuracy directly, we address a complementary problem: \textit{predicting when an LLM grader is likely to be correct}. This enables selective automation where high-confidence predictions are processed automatically while uncertain cases are flagged for human review. We compare three confidence estimation methods (self-reported confidence, self-consistency voting, and token probability) across seven LLMs of varying scale (4B to 120B parameters) on three educational datasets: RiceChem (long-answer chemistry), SciEntsBank, and Beetle (short-answer science). Our experiments reveal that self-reported confidence consistently achieves the best calibration across all conditions (avg ECE 0.166 vs 0.229 for self-consistency). Surprisingly, self-consistency remains 38\% worse despite requiring 5$\times$ the inference cost. Larger models exhibit substantially better calibration though gains vary by dataset and method (e.g., a 28\% ECE reduction for self-reported), with GPT-OSS-120B achieving the best calibration (avg ECE 0.100) and strong discrimination (avg AUC 0.668). We also observe that confidence is strongly top-skewed across methods, creating a ``confidence floor'' that practitioners must account for when setting thresholds. These findings suggest that simply asking LLMs to report their confidence provides a practical approach for identifying reliable grading predictions. Code is available \href{https://github.com/sonkar-lab/llm_grading_calibration}{here}.

翻译：大语言模型（LLMs）在自动化评分领域展现出潜力，但其输出结果可能不可靠。本研究并未直接提升评分准确率，而是聚焦于一个互补性问题：\textit{预测LLM评分器何时可能正确}。这能实现选择性自动化——高置信度预测自动处理，不确定案例则标记供人工复核。我们对比了三种置信度估计方法（自报告置信度、自一致性投票、词元概率），在三个教育数据集（长答案化学题RiceChem、短答案科学题SciEntsBank与Beetle）上评估了七个不同规模（4B至120B参数）的LLM。实验表明，自报告置信度在所有条件下均取得最佳校准效果（平均ECE 0.166，相比自一致性的0.229）。令人意外的是，尽管自一致性需耗费5倍推理成本，其性能仍低38%。大模型校准效果显著更优，但收益因数据集与方法而异（如自报告法的ECE降低28%），其中GPT-OSS-120B取得最佳校准（平均ECE 0.100）与强区分能力（平均AUC 0.668）。我们还发现，各种方法的置信度均呈现强烈右偏分布，形成实践者在设定阈值时必须考虑的“置信度下限”。这些发现表明，直接要求LLM报告置信度是识别可靠评分预测的实用方案。代码见 \href{https://github.com/sonkar-lab/llm_grading_calibration}{此处}。