Employing Large Language Models (LLMs) to assess the quality of generated responses, such as prompting instruct-tuned models or fine-tuning judge models, has become a widely adopted evaluation method. It is also known that such evaluators are vulnerable to biases, such as favoring longer responses. While it is important to overcome this problem, the specifics of these biases remain under-explored. In this work, we qualitatively identify six types of biases inherent in various judge models. We propose EvalBiasBench as a meta-evaluation collection of hand-crafted test cases for each bias type. Additionally, we present de-biasing dataset construction methods and the associated preference dataset OffsetBias. Experimental results demonstrate that fine-tuning on our dataset significantly enhances the robustness of judge models against biases and improves performance across most evaluation scenarios. We release our datasets and the fine-tuned judge model to public.
翻译:利用大型语言模型(如指令微调模型或微调评判模型)评估生成响应的质量,已成为一种广泛采用的评估方法。众所周知,此类评估模型易受多种偏差影响,例如倾向于偏好更长的回复。尽管克服这一问题至关重要,但这些偏差的具体细节仍未得到充分探索。在本研究中,我们定性识别了各类评判模型中固有的六种偏差类型。我们提出了EvalBiasBench,作为针对每种偏差类型手工构建测试用例的元评估集合。此外,我们提出了去偏数据集构建方法及相应的偏好数据集OffsetBias。实验结果表明,基于我们数据集进行微调能显著提升评判模型对偏差的鲁棒性,并在大多数评估场景中改善其性能。我们已将数据集及微调后的评判模型公开发布。