Recent advances in Large Language Models (LLMs) have incentivized the development of LLM-as-a-judge, an application of LLMs where they are used as judges to decide the quality of a certain piece of text given a certain context. However, previous studies have demonstrated that LLM-as-a-judge can be biased towards different aspects of the judged texts, which often do not align with human preference. One of the identified biases is language bias, which indicates that the decision of LLM-as-a-judge can differ based on the language of the judged texts. In this paper, we study two types of language bias in pairwise LLM-as-a-judge: (1) performance disparity between languages when the judge is prompted to compare options from the same language, and (2) bias towards options written in major languages when the judge is prompted to compare options of two different languages. We find that for same-language judging, there exist significant performance disparities across language families, with European languages consistently outperforming African languages, and this bias is more pronounced in culturally-related subjects. For inter-language judging, we observe that most models favor English answers, and that this preference is influenced more by answer language than question language. Finally, we investigate whether language bias is in fact caused by low-perplexity bias, a previously identified bias of LLM-as-a-judge, and we find that while perplexity is slightly correlated with language bias, language bias cannot be fully explained by perplexity only.
翻译:大型语言模型(LLM)的最新进展推动了LLM-as-a-Judge的发展,这是一种将LLM作为评判者,在给定特定上下文时决定文本质量的应用。然而,先前研究表明,LLM-as-a-Judge可能对评判文本的不同方面存在偏见,这些偏见往往与人类偏好不一致。其中一种已识别的偏见是语言偏见,即LLM-as-a-Judge的决策可能因评判文本的语言而异。本文研究了成对LLM-as-a-Judge中的两种语言偏见:(1)当评判者被要求比较同一语言的选项时,不同语言之间的性能差异;(2)当评判者被要求比较两种不同语言的选项时,对主要语言选项的偏好。我们发现,在同语言评判中,不同语系之间存在显著的性能差异,欧洲语言始终优于非洲语言,且这种偏见在文化相关主题中更为明显。在跨语言评判中,我们观察到大多数模型偏向英语答案,并且这种偏好受答案语言的影响大于问题语言。最后,我们探究了语言偏见是否实际上由低困惑度偏见(一种先前已识别的LLM-as-a-Judge偏见)引起,并发现虽然困惑度与语言偏见略有相关,但仅凭困惑度无法完全解释语言偏见。