We investigate a surprising limitation of LLMs: their inability to consistently generate text in a user's desired language. We create the Language Confusion Benchmark (LCB) to evaluate such failures, covering 15 typologically diverse languages with existing and newly-created English and multilingual prompts. We evaluate a range of LLMs on monolingual and cross-lingual generation reflecting practical use cases, finding that Llama Instruct and Mistral models exhibit high degrees of language confusion and even the strongest models fail to consistently respond in the correct language. We observe that base and English-centric instruct models are more prone to language confusion, which is aggravated by complex prompts and high sampling temperatures. We find that language confusion can be partially mitigated via few-shot prompting, multilingual SFT and preference tuning. We release our language confusion benchmark, which serves as a first layer of efficient, scalable multilingual evaluation at https://github.com/for-ai/language-confusion.
翻译:我们研究大型语言模型一个令人惊讶的局限性:其无法始终如一地生成用户期望语言的文本。我们创建了语言混淆基准测试(LCB)来评估此类失败案例,涵盖15种类型学上多样化的语言,并使用现有及新构建的英语与多语言提示。我们针对反映实际应用场景的单语及跨语言生成任务评估了一系列大型语言模型,发现Llama Instruct和Mistral模型表现出高度的语言混淆现象,即使是最强大的模型也无法始终以正确语言进行回应。我们观察到基础模型及以英语为中心的指导模型更容易出现语言混淆,而复杂提示和高采样温度会加剧这一问题。研究发现通过少样本提示、多语言监督微调及偏好调优可以部分缓解语言混淆。我们在https://github.com/for-ai/language-confusion发布了语言混淆基准测试,该基准可作为高效、可扩展多语言评估的第一层工具。