Evaluating new large language models typically requires costly human annotation campaigns at scale. LLM-as-a-judge offers a cheaper alternative, but judge scores carry systematic errors - such as position bias, self-preference, or intransitivity - that can strongly miscalibrate the resulting rankings. We quantify the resulting judge-human disagreement at two complementary levels. At the local level, we estimate per-battle uncertainty from the judge's own score differences by propagating calibrated win probabilities rather than hard labels into the Bradley-Terry procedure. This alone provides a drastic improvement to Elo estimation accuracy, bringing LLM-derived ratings within 17.9 Elo MAE of human-derived ones when averaged over 55 held-out models on LMArena. At the global level, we apply split conformal prediction to the residual gap between LLM-derived and human-derived Elo ratings across held-out models, producing prediction intervals with distribution-free marginal coverage guarantees that account for irreducible LLM-human disagreement. Together, these two layers yield a low-cost evaluation tool that provides developers with calibrated Elo estimates and honest uncertainty bounds, without access to large-scale human annotations. To facilitate reproducibility, we release our code at https://github.com/kargibora/SoftElo .
翻译:评估新的大语言模型通常需要大规模的人工标注活动。LLM作为评判者提供了一种更低成本的替代方案,但评判者的评分存在系统性误差——例如位置偏差、自我偏好或不可传递性——这些误差可能导致最终排名严重失真。我们在两个互补层面量化了这种评判者与人类评分之间的不一致性。在局部层面,通过将校准后的获胜概率(而非硬标签)传播到Bradley-Terry过程中,我们根据评判者自身的分数差异估计每场比赛的不确定性。仅此一项就能大幅提升Elo估计的准确性,使基于LLM得出的评级在LMArena上针对55个保留模型的平均Elo MAE接近人类评级的17.9个点。在全局层面,我们将分裂共形预测应用于LLM导出与人类导出的Elo评级之间的残差差距(基于保留模型),生成具有无分布边际覆盖保证的预测区间,以解释不可约的LLM-人类不一致性。这两层方法共同构成了一种低成本评估工具,能在无需大规模人工标注的情况下,为开发者提供校准后的Elo估计和诚实的置信区间。为促进可复现性,我们将代码发布于https://github.com/kargibora/SoftElo。