The use of large language models (LLMs) as judges, particularly in preference comparisons, has become widespread, but this reveals a notable bias towards longer responses, undermining the reliability of such evaluations. To better understand such bias, we propose to decompose the preference evaluation metric, specifically the win rate, into two key components: desirability and information mass, where the former is length-independent and related to trustworthiness such as correctness, toxicity, and consistency, and the latter is length-dependent and represents the amount of information in the response. We empirically demonstrated the decomposition through controlled experiments and found that response length impacts evaluations by influencing information mass. To derive a reliable evaluation metric that assesses content quality without being confounded by response length, we propose AdapAlpaca, a simple yet effective adjustment to win rate measurement. Specifically, AdapAlpaca ensures a fair comparison of response quality by aligning the lengths of reference and test model responses under equivalent length intervals.
翻译:将大语言模型(LLM)作为评判者,特别是在偏好比较中,已变得十分普遍,但这揭示了一种对较长响应的显著偏差,从而削弱了此类评估的可靠性。为了更好地理解这种偏差,我们提出将偏好评估指标(具体为胜率)分解为两个关键组成部分:合意性与信息量。其中,前者与长度无关,涉及诸如正确性、毒性、一致性等可信度相关属性;后者与长度相关,代表响应中所含的信息量。我们通过受控实验实证验证了该分解,并发现响应长度通过影响信息量来影响评估。为了获得一个不受响应长度混淆、能够评估内容质量的可靠评估指标,我们提出了AdapAlpaca,一种对胜率测量的简单而有效的调整方法。具体而言,AdapAlpaca通过在同等长度区间内对齐参考模型与测试模型响应的长度,确保了对响应质量的公平比较。