Large language models (LLMs) are widely used as scalable evaluators of model responses in lieu of human annotators. However, imperfect sensitivity and specificity of the LLM judges induce bias in naive evaluation scores. We propose a simple plug-in framework that corrects this bias and enables statistically principled uncertainty quantification. Our framework constructs confidence intervals that account for uncertainty from both the test dataset and a human-labeled calibration dataset. Additionally, it uses an adaptive strategy to allocate calibration samples for tighter intervals. Importantly, we characterize parameter regimes defined by the true evaluation score and the LLM judge's sensitivity and specificity in which our LLM-based evaluation yields more reliable estimates than human-only evaluation. Moreover, we show that our framework remains unbiased under distribution shift between the test and calibration datasets, in contrast to existing approaches.
翻译:大语言模型(LLM)正被广泛用作模型响应的可扩展评估工具,以替代人工标注。然而,LLM评估者存在的不完美敏感性与特异性会在原始评估分数中引入偏差。本文提出一种简单的插件式框架,用于校正该偏差并实现基于统计原理的不确定性量化。该框架构建的置信区间同时考虑了测试数据集与人工标注校准数据集的不确定性,并采用自适应策略分配校准样本以获得更紧凑的区间。尤为重要的是,我们界定了由真实评估分数及LLM评估者敏感性与特异性定义的参数区间,在此区间内基于LLM的评估比纯人工评估具有更高可靠性。此外,我们证明在测试集与校准集存在分布偏移时,本框架仍能保持无偏性,这与现有方法形成鲜明对比。