Recent research has shown that large language models (LLM) favor own outputs when acting as judges, undermining the integrity of automated post-training and evaluation workflows. However, it is difficult to disentangle which evaluation biases are explained by narcissism versus general experimental confounds, distorting measurements of self-preference bias. We discover a core methodological confound which could reduce measurement error by 89.6%. Specifically, LLM evaluators may deliver self-preferring verdicts when the judge responds to queries which they completed incorrectly themselves; this would be true regardless of whether one of their responses is their own. To decouple self-preference signals from noisy outputs on hard problems, we introduce an Evaluator Quality Baseline, which compares the probability that a judge incorrectly votes for itself against the probability that it votes for an incorrect response from another model. Evaluating this simple baseline on 37,448 queries, only 51% of initial findings retain statistical significance. Finally, we turn towards characterizing the entropy of "easy" versus "hard" evaluation votes from LLM judges. Our corrective baseline enables future research on self-preference by eliminating noisy data from potential solutions. More widely, this work contributes to the growing body of work on cataloging and isolating judge-bias effects.
翻译:近期研究表明,大型语言模型(LLM)在作为评估者时倾向于偏好自身输出,这损害了自动化后训练与评估流程的完整性。然而,难以厘清哪些评估偏差源于自恋效应,哪些源于一般性实验混杂因素,导致自偏好偏差的测量结果失真。我们发现一个核心方法学混杂因素,其消除可使测量误差降低89.6%。具体而言,当LLM评估者对自身曾错误回答的查询进行评判时,可能作出自我偏好的判定;这种效应即使在其评判的回复并非自身生成时同样存在。为从困难问题产生的噪声输出中分离自偏好信号,我们提出评估者质量基线方法,通过比较评估者错误选择自身回复的概率与错误选择其他模型错误回复的概率实现解耦。在37,448条查询数据上验证该基线方法后,仅51%的初始发现保持统计显著性。最后,我们进一步量化LLM评估者对"简单"与"困难"评估投票的熵值特征。本校正基线通过剔除潜在解决方案中的噪声数据,为未来自偏好研究奠定基础。更广泛而言,本研究为不断增长的评估者偏差效应分类与隔离研究体系作出了贡献。