Large language models (LLMs) are increasingly used as judges to replace costly human preference labels in pairwise evaluation. Despite their practicality, LLM judges remain prone to miscalibration and systematic biases. This paper proposes SCOPE (Selective Conformal Optimized Pairwise Evaluation), a framework for selective pairwise judging with finite-sample statistical guarantees. Under exchangeability, SCOPE calibrates an acceptance threshold such that the error rate among non-abstained judgments is at most a user-specified level $α$. To provide SCOPE with a bias-neutral uncertainty signal, we introduce Bidirectional Preference Entropy (BPE), which queries the judge under both response positions, aggregates the implied preference probabilities to enforce invariance to response order, and converts the aggregated probability into an entropy-based uncertainty score. Across MT-Bench, RewardBench, and Chatbot Arena, BPE improves uncertainty quality over standard confidence proxies, providing a stronger selection signal that enables SCOPE to consistently meet the target risk level while retaining good coverage across judge scales. In particular, at $α= 0.10$, SCOPE consistently satisfies the risk bound across all benchmarks and judge scales (empirical risk $\approx 0.097$ to $0.099$), while retaining substantial coverage, reaching $0.89$ on RewardBench with Qwen-14B and $0.98$ on RewardBench with Qwen-32B. Compared to naïve baselines, SCOPE accepts up to $2.4\times$ more judgments on MT-Bench with Qwen-7B under the same target risk constraint, demonstrating that BPE enables reliable and high-coverage LLM-based evaluation.
翻译:大语言模型(LLM)正日益被用作评判者,以替代成本高昂的人类偏好标注进行配对评估。尽管具有实用性,LLM评判者仍存在校准失准和系统性偏差的问题。本文提出SCOPE(选择性共形优化配对评估)框架,该框架为选择性配对评判提供有限样本统计保证。在可交换性假设下,SCOPE校准接受阈值,使得非弃权评判中的错误率至多为用户指定水平$α$。为向SCOPE提供偏差中立的 uncertainty 信号,我们提出双向偏好熵(BPE)方法,该方法通过在两个响应位置上查询评判者,聚合隐含的偏好概率以实现对响应顺序的不变性,并将聚合概率转换为基于熵的 uncertainty 分数。在MT-Bench、RewardBench和Chatbot Arena上的实验表明,BPE相较于标准置信度代理指标显著提升了 uncertainty 质量,提供了更强的选择信号,使SCOPE能够在不同评判模型规模下持续满足目标风险水平并保持良好覆盖率。具体而言,在$α=0.10$时,SCOPE在所有基准测试和评判模型规模上均持续满足风险界限(实证风险$\approx 0.097$至$0.099$),同时保持可观覆盖率:在RewardBench上使用Qwen-14B达到$0.89$,使用Qwen-32B达到$0.98$。与朴素基线方法相比,在相同目标风险约束下,SCOPE在MT-Bench上使用Qwen-7B时能接受多达$2.4$倍的评判,证明BPE能够实现可靠且高覆盖率的基于LLM的评估。