Recently, there has been a trend of evaluating the Large Language Model (LLM) quality in the flavor of LLM-as-a-Judge, namely leveraging another LLM to evaluate the current output quality. However, existing judges are proven to be biased, namely they would favor answers which present better superficial quality (such as verbosity, fluency) while ignoring the instruction following ability. In this work, we propose systematic research about the bias of LLM-as-a-Judge. Specifically, for closed-source judge models, we apply calibration to mitigate the significance of superficial quality, both on probability level and prompt level. For open-source judge models, we propose to mitigate the bias by contrastive training, with curated negative samples that deviate from instruction but present better superficial quality. We apply our methods on the bias evaluation benchmark, and experiment results show our methods mitigate the bias by a large margin while maintaining a satisfactory evaluation accuracy.
翻译:近年来,评估大型语言模型(LLM)质量呈现出一种LLM即评判者(LLM-as-a-Judge)的趋势,即利用另一个LLM来评估当前输出的质量。然而,现有评判者已被证明存在偏差,即它们会倾向于偏爱那些表现出更好表面质量(如冗长性、流畅性)的答案,而忽视了遵循指令的能力。在本工作中,我们提出了关于LLM即评判者偏差的系统性研究。具体而言,对于闭源评判模型,我们应用校准方法来降低表面质量的重要性,包括在概率层面和提示层面。对于开源评判模型,我们提出通过对比训练来缓解偏差,使用精心构建的、偏离指令但呈现更好表面质量的负样本。我们在偏差评估基准上应用了我们的方法,实验结果表明,我们的方法在保持令人满意的评估准确性的同时,大幅缓解了偏差。