一种无真实标签评估大语言模型的法官感知排序框架 (A Judge-Aware Ranking Framework for Evaluating Large Language Models without Ground Truth)

Evaluating large language models (LLMs) on open-ended tasks without ground-truth labels is increasingly done via the LLM-as-a-judge paradigm. A critical but under-modeled issue is that judge LLMs differ substantially in reliability; treating all judges equally can yield biased leaderboards and misleading uncertainty estimates. More data can make evaluation more confidently wrong under misspecified aggregation. We propose a judge-aware ranking framework that extends the Bradley-Terry-Luce model by introducing judge-specific discrimination parameters, jointly estimating latent model quality and judge reliability from pairwise comparisons without reference labels. We establish identifiability up to natural normalizations and prove consistency and asymptotic normality of the maximum likelihood estimator, enabling confidence intervals for score differences and rank comparisons. Across multiple public benchmarks and a newly collected dataset, our method improves agreement with human preferences, achieves higher data efficiency than unweighted baselines, and produces calibrated uncertainty quantification for LLM rankings.

翻译：在缺乏真实标签的开放式任务中评估大语言模型（LLMs）正日益采用“LLM即法官”范式。一个关键但未被充分建模的问题是：不同法官LLM的可靠性存在显著差异；平等对待所有法官可能导致有偏的排行榜和误导性的不确定性估计。在聚合方法设定错误的情况下，更多数据反而可能使评估结果更自信地偏离真相。我们提出一种法官感知排序框架，该框架通过引入法官特定的判别参数扩展了Bradley-Terry-Luce模型，能够在无参考标签的情况下从成对比较中联合估计潜在模型质量与法官可靠性。我们建立了模型在自然归一化条件下的可识别性，并证明了最大似然估计量的一致性与渐近正态性，从而能够为分数差异与排名比较提供置信区间。在多个公开基准和新收集的数据集上的实验表明，我们的方法提升了与人类偏好的一致性，相比未加权的基线实现了更高的数据效率，并为LLM排名提供了校准后的不确定性量化。

相关内容

排序

关注 313

排序是计算机内经常进行的一种操作，其目的是将一组“无序”的记录序列调整为“有序”的记录序列。分内部排序和外部排序。若整个排序过程不需要访问外存便能完成，则称此类排序问题为内部排序。反之，若参加排序的记录数量很大，整个序列的排序过程不可能在内存中完成，则称此类排序问题为外部排序。内部排序的过程是一个逐步扩大记录的有序序列长度的过程。

法律领域中的大语言模型智能体：分类体系、应用场景与挑战

专知会员服务

16+阅读 · 1月14日

智能体评判者（Agent-as-a-Judge）研究综述

专知会员服务

37+阅读 · 1月9日

【伯克利博士论文】基于代码结构感知方法推进代码生成大型语言模型的发展

专知会员服务

23+阅读 · 2025年7月21日

【斯坦福博士论文】大语言模型的AI辅助评估

专知会员服务

31+阅读 · 2025年3月30日