Evaluating large language models (LLMs) on open-ended tasks without ground-truth labels is increasingly done via the LLM-as-a-judge paradigm. A critical but under-modeled issue is that judge LLMs differ substantially in reliability; treating all judges equally can yield biased leaderboards and misleading uncertainty estimates. More data can make evaluation more confidently wrong under misspecified aggregation. We propose a judge-aware ranking framework that extends the Bradley-Terry-Luce model by introducing judge-specific discrimination parameters, jointly estimating latent model quality and judge reliability from pairwise comparisons without reference labels. We establish identifiability up to natural normalizations and prove consistency and asymptotic normality of the maximum likelihood estimator, enabling confidence intervals for score differences and rank comparisons. Across multiple public benchmarks and a newly collected dataset, our method improves agreement with human preferences, achieves higher data efficiency than unweighted baselines, and produces calibrated uncertainty quantification for LLM rankings.
翻译:在开放式任务中无标准答案标签评估大语言模型时,采用"大语言模型作为裁判"的范式日益盛行。一个关键但尚未充分建模的问题在于:不同裁判大语言模型的可靠性存在显著差异,若将所有裁判同等对待,将导致排名榜单产生偏差及不确定性估计失真。在错误设定的聚合规则下,更多数据反而可能使评估结果更自信地偏离正确方向。我们提出一种可感知裁判的排序框架——通过引入裁判特定判别参数扩展布拉德利-特里-卢斯模型,在无参考标签条件下,从成对比较中联合估计潜在模型质量与裁判可靠性。该方法在自然归一化约束下具有可辨识性,并证明最大似然估计的一致性与渐近正态性,可为评分差异与排名比较提供置信区间。在多个公开基准测试及新采集数据集上,本方法在人类偏好一致性、数据效率(超过未加权基线)以及大语言模型排名校准的不确定性量化方面均表现更优。