Large Language Models (LLMs) show promise for automated grading, but their outputs can be unreliable. Rather than improving grading accuracy directly, we address a complementary problem: \textit{predicting when an LLM grader is likely to be correct}. This enables selective automation where high-confidence predictions are processed automatically while uncertain cases are flagged for human review. We compare three confidence estimation methods (self-reported confidence, self-consistency voting, and token probability) across seven LLMs of varying scale (4B to 120B parameters) on three educational datasets: RiceChem (long-answer chemistry), SciEntsBank, and Beetle (short-answer science). Our experiments reveal that self-reported confidence consistently achieves the best calibration across all conditions (avg ECE 0.166 vs 0.229 for self-consistency). Surprisingly, self-consistency remains 38\% worse despite requiring 5$\times$ the inference cost. Larger models exhibit substantially better calibration though gains vary by dataset and method (e.g., a 28\% ECE reduction for self-reported), with GPT-OSS-120B achieving the best calibration (avg ECE 0.100) and strong discrimination (avg AUC 0.668). We also observe that confidence is strongly top-skewed across methods, creating a ``confidence floor'' that practitioners must account for when setting thresholds. These findings suggest that simply asking LLMs to report their confidence provides a practical approach for identifying reliable grading predictions. Code is available \href{https://github.com/sonkar-lab/llm_grading_calibration}{here}.
翻译:大语言模型(LLMs)在自动化评分领域展现出潜力,但其输出结果可能不可靠。本研究并未直接提升评分准确率,而是聚焦于一个互补性问题:\textit{预测LLM评分器何时可能正确}。这能实现选择性自动化——高置信度预测自动处理,不确定案例则标记供人工复核。我们对比了三种置信度估计方法(自报告置信度、自一致性投票、词元概率),在三个教育数据集(长答案化学题RiceChem、短答案科学题SciEntsBank与Beetle)上评估了七个不同规模(4B至120B参数)的LLM。实验表明,自报告置信度在所有条件下均取得最佳校准效果(平均ECE 0.166,相比自一致性的0.229)。令人意外的是,尽管自一致性需耗费5倍推理成本,其性能仍低38%。大模型校准效果显著更优,但收益因数据集与方法而异(如自报告法的ECE降低28%),其中GPT-OSS-120B取得最佳校准(平均ECE 0.100)与强区分能力(平均AUC 0.668)。我们还发现,各种方法的置信度均呈现强烈右偏分布,形成实践者在设定阈值时必须考虑的“置信度下限”。这些发现表明,直接要求LLM报告置信度是识别可靠评分预测的实用方案。代码见 \href{https://github.com/sonkar-lab/llm_grading_calibration}{此处}。