We investigate a surprising limitation of LLMs: their inability to consistently generate text in a user's desired language. We create the Language Confusion Benchmark (LCB) to evaluate such failures, covering 15 typologically diverse languages with existing and newly-created English and multilingual prompts. We evaluate a range of LLMs on monolingual and cross-lingual generation reflecting practical use cases, finding that Llama Instruct and Mistral models exhibit high degrees of language confusion and even the strongest models fail to consistently respond in the correct language. We observe that base and English-centric instruct models are more prone to language confusion, which is aggravated by complex prompts and high sampling temperatures. We find that language confusion can be partially mitigated via few-shot prompting, multilingual SFT and preference tuning. We release our language confusion benchmark, which serves as a first layer of efficient, scalable multilingual evaluation at https://github.com/for-ai/language-confusion.
翻译:本研究探讨了大型语言模型(LLMs)一个令人惊讶的局限性:它们无法始终如一地生成用户所需语言的文本。我们创建了语言混淆基准测试(LCB)来评估此类失败案例,涵盖15种类型学上多样化的语言,并采用现有及新构建的英语与多语言提示。我们针对反映实际应用场景的单语及跨语言生成任务评估了一系列LLMs,发现Llama Instruct和Mistral模型表现出高度的语言混淆现象,即使是最强大的模型也无法始终以正确语言进行回复。我们观察到,基础模型和以英语为中心的指令微调模型更容易出现语言混淆,且复杂提示和高采样温度会加剧这一问题。研究发现,通过少样本提示、多语言监督微调(SFT)和偏好调优可以部分缓解语言混淆。我们在https://github.com/for-ai/language-confusion 发布了语言混淆基准测试,该基准可作为高效、可扩展多语言评估的第一层工具。