By training on text in various languages, large language models (LLMs) typically possess multilingual support and demonstrate remarkable capabilities in solving tasks described in different languages. However, LLMs can exhibit linguistic discrimination due to the uneven distribution of training data across languages. That is, LLMs are hard to keep the consistency of responses when faced with the same task but depicted in different languages. In this study, we first explore the consistency in the LLMs' outputs responding to queries in various languages from two aspects: safety and quality. We conduct this analysis with two datasets (AdvBench and NQ) based on four LLMs (Llama2-13b, Gemma-7b, GPT-3.5-turbo and Gemini-pro). The results show that LLMs exhibit stronger human alignment capabilities with queries in English, French, Russian, and Spanish (only 1.04\% of harmful queries successfully jailbreak on average) compared to queries in Bengali, Georgian, Nepali and Maithili (27.7\% of harmful queries jailbreak successfully on average). Moreover, for queries in English, Danish, Czech and Slovenian, LLMs tend to produce responses with a higher quality (with 0.1494 $F_1$ score on average) compared to the other languages. Upon these findings, we propose LDFighter, a similarity-based voting, to mitigate the linguistic discrimination in LLMs. LDFighter ensures consistent service for different language speakers. We evaluate LDFighter with both benign queries and harmful queries. The results show that LDFighter not only significantly reduces the jailbreak success rate but also improve the response quality on average, demonstrating its effectiveness.
翻译:通过在多种语言的文本上进行训练,大语言模型(LLMs)通常具备多语言支持能力,并在处理以不同语言描述的任务时展现出卓越的性能。然而,由于训练数据在不同语言间分布不均,LLMs可能表现出语言歧视。也就是说,当面对相同任务但使用不同语言描述时,LLMs难以保持响应的一致性。在本研究中,我们首先从安全性和质量两个方面探究LLMs对多种语言查询的响应一致性。我们基于两个数据集(AdvBench和NQ)对四个LLMs(Llama2-13b、Gemma-7b、GPT-3.5-turbo和Gemini-pro)进行了分析。结果显示,与使用孟加拉语、格鲁吉亚语、尼泊尔语和迈蒂利语(平均27.7%的有害查询成功越狱)相比,LLMs对英语、法语、俄语和西班牙语查询(平均仅1.04%的有害查询成功越狱)展现出更强的人类对齐能力。此外,对于英语、丹麦语、捷克语和斯洛文尼亚语的查询,LLMs倾向于生成更高质量的响应(平均F1分数为0.1494),而其他语言则表现较差。基于这些发现,我们提出了LDFighter——一种基于相似度投票的方法,以缓解LLMs中的语言歧视。LDFighter确保为不同语言使用者提供一致的服务。我们使用良性查询和有害查询对LDFighter进行了评估。结果表明,LDFighter不仅显著降低了越狱成功率,还平均提升了响应质量,证明了其有效性。