``LLM-as-a-judge,'' which utilizes large language models (LLMs) as evaluators, has proven effective in many evaluation tasks. However, evaluator LLMs exhibit numerical bias, a phenomenon where certain evaluation scores are generated disproportionately often, leading reduced evaluation performance. This study investigates the cause of this bias. Given that most evaluator LLMs are aligned through instruction tuning and preference tuning, and that prior research suggests alignment reduces output diversity, we hypothesize that numerical bias arises from alignment. To test this, we compare outputs from pre- and post-alignment LLMs, and observe that alignment indeed increases numerical bias. We also explore mitigation strategies for post-alignment LLMs, including temperature scaling, distribution calibration, and score range adjustment. Among these, score range adjustment is most effective in reducing bias and improving performance, though still heuristic. Our findings highlight the need for further work on optimal score range selection and more robust mitigation strategies.
翻译:“LLM 即评判者”方法利用大型语言模型作为评估器,已在众多评估任务中被证明有效。然而,评估器 LLM 表现出数值偏差现象,即某些评估分数被不成比例地频繁生成,导致评估性能下降。本研究旨在探究这种偏差的成因。鉴于大多数评估器 LLM 通过指令微调和偏好对齐进行对齐,且先前研究表明对齐会降低输出多样性,我们假设数值偏差源于对齐过程。为验证此假设,我们比较了对齐前后 LLM 的输出,观察到对齐确实加剧了数值偏差。我们还探索了对齐后 LLM 的缓解策略,包括温度缩放、分布校准和分数范围调整。其中,分数范围调整在减少偏差和提升性能方面最为有效,尽管仍属于启发式方法。我们的研究结果强调,需要在最优分数范围选择和更鲁棒的缓解策略方面开展进一步工作。