With Retrieval Augmented Generation (RAG), Large Language Models (LLMs) are playing a pivotal role in information search and are being adopted globally. Although the multilingual capability of LLMs offers new opportunities to bridge the language barrier, do these capabilities translate into real-life scenarios where linguistic divide and knowledge conflicts between multilingual sources are known occurrences? In this paper, we studied LLM's linguistic preference in a RAG-based information search setting. We found that LLMs displayed systemic bias towards information in the same language as the query language in both information retrieval and answer generation. Furthermore, in scenarios where there is little information in the language of the query, LLMs prefer documents in high-resource languages, reinforcing the dominant views. Such bias exists for both factual and opinion-based queries. Our results highlight the linguistic divide within multilingual LLMs in information search systems. The seemingly beneficial multilingual capability of LLMs may backfire on information parity by reinforcing language-specific information cocoons or filter bubbles further marginalizing low-resource views.
翻译:借助检索增强生成技术,大语言模型在信息搜索中发挥着关键作用,并在全球范围内得到应用。尽管大语言模型的多语言能力为跨越语言障碍提供了新机遇,但这些能力能否真正适用于存在语言鸿沟和多语言知识冲突的现实场景?本文研究了基于检索增强生成的信息搜索场景中大语言模型的语言偏好。我们发现,大语言模型在信息检索和答案生成过程中均表现出对查询语言同源信息的系统性偏好。此外,当查询语言对应信息稀缺时,大语言模型倾向于选择高资源语言的文档,从而强化了主流观点。这种偏见在事实性查询和观点性查询中均存在。我们的研究结果凸显了多语言大语言模型在信息搜索系统中存在的语言鸿沟。大语言模型看似有益的多语言能力,可能通过强化特定语言的信息茧房或过滤气泡,进一步边缘化低资源语言的观点,从而对信息平等产生反作用。