A Judge-Aware Ranking Framework for Evaluating Large Language Models without Ground Truth

Evaluating large language models (LLMs) on open-ended tasks without ground-truth labels is increasingly done via the LLM-as-a-judge paradigm. A critical but under-modeled issue is that judge LLMs differ substantially in reliability; treating all judges equally can yield biased leaderboards and misleading uncertainty estimates. More data can make evaluation more confidently wrong under misspecified aggregation. We propose a judge-aware ranking framework that extends the Bradley-Terry-Luce model by introducing judge-specific discrimination parameters, jointly estimating latent model quality and judge reliability from pairwise comparisons without reference labels. We establish identifiability up to natural normalizations and prove consistency and asymptotic normality of the maximum likelihood estimator, enabling confidence intervals for score differences and rank comparisons. Across multiple public benchmarks and a newly collected dataset, our method improves agreement with human preferences, achieves higher data efficiency than unweighted baselines, and produces calibrated uncertainty quantification for LLM rankings.

翻译：在开放式任务中无标准答案标签评估大语言模型时，采用"大语言模型作为裁判"的范式日益盛行。一个关键但尚未充分建模的问题在于：不同裁判大语言模型的可靠性存在显著差异，若将所有裁判同等对待，将导致排名榜单产生偏差及不确定性估计失真。在错误设定的聚合规则下，更多数据反而可能使评估结果更自信地偏离正确方向。我们提出一种可感知裁判的排序框架——通过引入裁判特定判别参数扩展布拉德利-特里-卢斯模型，在无参考标签条件下，从成对比较中联合估计潜在模型质量与裁判可靠性。该方法在自然归一化约束下具有可辨识性，并证明最大似然估计的一致性与渐近正态性，可为评分差异与排名比较提供置信区间。在多个公开基准测试及新采集数据集上，本方法在人类偏好一致性、数据效率（超过未加权基线）以及大语言模型排名校准的不确定性量化方面均表现更优。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

面向战斗模拟空间推理的大语言模型指挥官智能体框架

专知会员服务

25+阅读 · 3月18日

大型语言模型的规模效应局限

专知会员服务

14+阅读 · 2025年11月18日

大语言模型基准综述

专知会员服务

27+阅读 · 2025年8月22日

【斯坦福博士论文】大语言模型的AI辅助评估

专知会员服务

31+阅读 · 2025年3月30日