Large language models (LLMs) are increasingly used as judges to replace costly human preference labels in pairwise evaluation. Despite their practicality, LLM judges remain prone to miscalibration and systematic biases. This paper proposes SCOPE (Selective Conformal Optimized Pairwise Evaluation), a framework for selective pairwise judging with finite-sample statistical guarantees. Under exchangeability, SCOPE calibrates an acceptance threshold such that the error rate among non-abstained judgments is at most a user-specified level $α$. To provide SCOPE with a bias-neutral uncertainty signal, we introduce Bidirectional Preference Entropy (BPE), which queries the judge under both response positions, aggregates the implied preference probabilities to enforce invariance to response order, and converts the aggregated probability into an entropy-based uncertainty score. Across MT-Bench, RewardBench, and Chatbot Arena, BPE improves uncertainty quality over standard confidence proxies, providing a stronger selection signal that enables SCOPE to consistently meet the target risk level while retaining good coverage across judge scales. In particular, at $α= 0.10$, \textsc{Scope} consistently satisfies the risk bound across all benchmarks and judge scales (empirical risk $\approx 0.097$ to $0.099$), while retaining substantial coverage, reaching $0.89$ on RewardBench with Qwen-14B and $0.98$ on RewardBench with Qwen-32B. Compared to naïve baselines, \textsc{Scope} accepts up to $2.4\times$ more judgments on MT-Bench with Qwen-7B under the same target risk constraint, demonstrating that BPE enables reliable and high-coverage LLM-based evaluation.
翻译:大语言模型(LLM)越来越多地被用作评判者,以替代成本高昂的人工偏好标注进行成对评估。尽管具有实用性,LLM评判者仍存在校准失准和系统性偏差的问题。本文提出SCOPE(选择性保形优化的成对评估),这是一个具有有限样本统计保证的选择性成对评判框架。在可交换性假设下,SCOPE校准一个接受阈值,使得未弃权评判中的错误率至多为用户指定的水平$α$。为了向SCOPE提供偏差中性的不确定性信号,我们引入了双向偏好熵(BPE)。该方法通过查询评判者在两种回答位置下的判断,聚合隐含的偏好概率以强制实现对回答顺序的不变性,并将聚合概率转换为基于熵的不确定性分数。在MT-Bench、RewardBench和Chatbot Arena上的实验表明,BPE相较于标准置信度代理指标提升了不确定性质量,提供了更强的选择信号,使得SCOPE能够持续满足目标风险水平,同时在多种评判模型规模上保持良好的覆盖范围。具体而言,在$α=0.10$时,\textsc{Scope}在所有基准测试和评判模型规模上均持续满足风险界限(经验风险$\approx 0.097$至$0.099$),同时保持了可观的覆盖范围:在RewardBench上使用Qwen-14B时达到$0.89$,使用Qwen-32B时达到$0.98$。与朴素基线相比,在相同目标风险约束下,\textsc{Scope}在MT-Bench上使用Qwen-7B时接受的评判数量最多可提升$2.4$倍,这证明BPE能够实现可靠且高覆盖率的基于LLM的评估。