Large language models (LLMs) are now widely used to evaluate the quality of text, a field commonly referred to as LLM-as-a-judge. While prior works mainly focus on point-wise and pair-wise evaluation paradigms. Rubric-based evaluation, where LLMs select a score from multiple rubrics, has received less analysis. In this work, we show that rubric-based evaluation implicitly resembles a multi-choice setting and therefore has position bias: LLMs prefer score options appearing at specific positions in the rubric list. Through controlled experiments across multiple models and datasets, we demonstrate consistent position bias. To mitigate this bias, we propose a balanced permutation strategy that evenly distributes each score option across positions. We show that aggregating scores across balanced permutations not only reveals latent position bias, but also improves correlation between the LLM-as-a-Judge and human. Our results suggest that rubric-based LLM-as-a-Judge is not inherently point-wise and that simple permutation-based calibration can substantially improve its reliability.
翻译:大型语言模型(LLM)现已被广泛用于评估文本质量,这一领域通常被称为LLM-as-a-Judge。先前的研究主要关注点式和对式评估范式,而基于评分准则的评估——即LLM从多个评分准则中选择一个分数——则较少得到分析。在本研究中,我们表明基于评分准则的评估隐式地类似于多项选择题设置,因此存在位置偏差:LLM倾向于选择出现在评分准则列表中特定位置的分数选项。通过对多个模型和数据集的受控实验,我们证明了这种位置偏差具有一致性。为了缓解这种偏差,我们提出了一种平衡排列策略,将每个分数选项均匀地分布到不同位置。我们证明,聚合平衡排列下的分数不仅能揭示潜在的位置偏差,还能提高LLM-as-a-Judge与人类评估之间的相关性。我们的结果表明,基于评分准则的LLM-as-a-Judge本质上并非点式评估,而基于简单排列的校准可以显著提高其可靠性。