Large language models (LLMs) need to serve everyone, including a global majority of non-English speakers. However, most LLMs today, and open LLMs in particular, are often intended for use in just English (e.g. Llama2, Mistral) or a small handful of high-resource languages (e.g. Mixtral, Qwen). Recent research shows that, despite limits in their intended use, people prompt LLMs in many different languages. Therefore, in this paper, we investigate the basic multilingual capabilities of state-of-the-art open LLMs beyond their intended use. For this purpose, we introduce MultiQ, a new silver standard benchmark for basic open-ended question answering with 27.4k test questions across a typologically diverse set of 137 languages. With MultiQ, we evaluate language fidelity, i.e.\ whether models respond in the prompted language, and question answering accuracy. All LLMs we test respond faithfully and/or accurately for at least some languages beyond their intended use. Most models are more accurate when they respond faithfully. However, differences across models are large, and there is a long tail of languages where models are neither accurate nor faithful. We explore differences in tokenization as a potential explanation for our findings, identifying possible correlations that warrant further investigation.
翻译:大语言模型需要为所有人服务,包括占全球多数的非英语使用者。然而,当前大多数大语言模型(尤其是开源模型)通常仅支持英语(如Llama2、Mistral)或少数高资源语言(如Mixtral、Qwen)。最新研究表明,尽管存在使用限制,用户仍会用多种不同语言提示这些模型。因此,本文研究了当前最先进开源大语言模型在预期使用范围之外的基础多语言能力。为此,我们提出MultiQ——一个覆盖137种类型多样语言、包含2.74万个测试问题的新型银标准基准测试,用于评估基础开放式问答。通过MultiQ,我们评估了语言忠实度(即模型是否以提示语言作答)和问答准确率。所有被测试模型在预期使用范围之外的部分语言上均表现出忠实且/或准确的响应。多数模型在忠实作答时准确率更高。然而,不同模型间差异显著,且存在大量模型既不准确也不忠实的语言类别。我们探索了分词差异作为潜在解释,发现了值得进一步研究的可能关联。