Large language models (LLMs) are widely used as scalable evaluators of model responses in lieu of human annotators. However, imperfect sensitivity and specificity of LLM judgments induce bias in naive evaluation scores. We propose a simple plug-in framework that corrects this bias and constructs confidence intervals accounting for uncertainty from both the test dataset and a human-evaluated calibration dataset, enabling statistically sound and practical LLM-based evaluation. Building on this framework, we introduce an adaptive calibration strategy for constructing the calibration dataset to reduce uncertainty in the estimated score. Notably, we characterize the regimes in which LLM-based evaluation within our framework produces more reliable estimates than fully human evaluation. Moreover, our framework is more robust to distribution shift between the test and calibration datasets than existing approaches.
翻译:大型语言模型(LLMs)被广泛用作模型响应的可扩展评估器,以替代人类标注者。然而,LLM判断的不完美敏感性和特异性会在原始评估分数中引入偏差。我们提出了一种简单的插件式框架来校正这种偏差,并构建置信区间以同时考虑测试数据集和人工评估校准数据集的不确定性,从而实现统计上可靠且实用的基于LLM的评估。基于此框架,我们引入了一种自适应校准策略来构建校准数据集,以降低估计分数的不确定性。值得注意的是,我们刻画了在本框架下基于LLM的评估比完全人工评估产生更可靠估计的机制。此外,与现有方法相比,我们的框架对测试数据集与校准数据集之间的分布偏移具有更强的鲁棒性。