Evaluating large language models (LLMs) on open-ended tasks without ground-truth labels is increasingly done via the LLM-as-a-judge paradigm. A critical but under-modeled issue is that judge LLMs differ substantially in reliability; treating all judges equally can yield biased leaderboards and misleading uncertainty estimates. More data can make evaluation more confidently wrong under misspecified aggregation. We propose a judge-aware ranking framework that extends the Bradley-Terry-Luce model by introducing judge-specific discrimination parameters, jointly estimating latent model quality and judge reliability from pairwise comparisons without reference labels. We establish identifiability up to natural normalizations and prove consistency and asymptotic normality of the maximum likelihood estimator, enabling confidence intervals for score differences and rank comparisons. Across multiple public benchmarks and a newly collected dataset, our method improves agreement with human preferences, achieves higher data efficiency than unweighted baselines, and produces calibrated uncertainty quantification for LLM rankings.
翻译:在缺乏真实标签的开放式任务中评估大语言模型(LLMs)正日益采用“LLM即法官”范式。一个关键但未被充分建模的问题是:不同法官LLM的可靠性存在显著差异;平等对待所有法官可能导致有偏的排行榜和误导性的不确定性估计。在聚合方法设定错误的情况下,更多数据反而可能使评估结果更自信地偏离真相。我们提出一种法官感知排序框架,该框架通过引入法官特定的判别参数扩展了Bradley-Terry-Luce模型,能够在无参考标签的情况下从成对比较中联合估计潜在模型质量与法官可靠性。我们建立了模型在自然归一化条件下的可识别性,并证明了最大似然估计量的一致性与渐近正态性,从而能够为分数差异与排名比较提供置信区间。在多个公开基准和新收集的数据集上的实验表明,我们的方法提升了与人类偏好的一致性,相比未加权的基线实现了更高的数据效率,并为LLM排名提供了校准后的不确定性量化。