Employing Large Language Models (LLMs) to assess the quality of generated responses, such as prompting instruct-tuned models or fine-tuning judge models, has become a widely adopted evaluation method. It is also known that such evaluators are vulnerable to biases, such as favoring longer responses. While it is important to overcome this problem, the specifics of these biases remain under-explored. In this work, we qualitatively identify six types of biases inherent in various judge models. We propose EvalBiasBench as a meta-evaluation collection of hand-crafted test cases for each bias type. Additionally, we present de-biasing dataset construction methods and the associated preference dataset OffsetBias. Experimental results demonstrate that fine-tuning on our dataset significantly enhances the robustness of judge models against biases and improves performance across most evaluation scenarios. We release our datasets and the fine-tuned judge model to public.
翻译:采用大型语言模型(LLMs)评估生成回复的质量(例如通过指令微调模型提示或微调评判模型)已成为广泛采用的评估方法。众所周知,此类评估器容易受到多种偏差的影响,例如倾向于更长的回复。尽管克服这一问题至关重要,但这些偏差的具体特性仍未得到充分探索。在本研究中,我们定性识别了各类评判模型中固有的六种偏差类型。我们提出EvalBiasBench作为针对每种偏差类型手工构建测试用例的元评估集合。此外,我们提出了去偏数据集构建方法及相应的偏好数据集OffsetBias。实验结果表明,基于我们数据集的微调能显著提升评判模型对抗偏差的鲁棒性,并在大多数评估场景中改善其性能。我们将公开发布数据集及微调后的评判模型。