Large language models (LLMs) have demonstrated multilingual capabilities; yet, they are mostly English-centric due to the imbalanced training corpora. Existing works leverage this phenomenon to improve their multilingual performances through translation, primarily on natural language processing (NLP) tasks. This work extends the evaluation from NLP tasks to real user queries and from English-centric LLMs to non-English-centric LLMs. While translation into English can help improve the performance of multilingual NLP tasks for English-centric LLMs, it may not be optimal for all scenarios. For culture-related tasks that need deep language understanding, prompting in the native language tends to be more promising as it better captures the nuances of culture and language. Our experiments reveal varied behaviors among different LLMs and tasks in the multilingual context. Therefore, we advocate for more comprehensive multilingual evaluation and more efforts toward developing multilingual LLMs beyond English-centric ones.
翻译:大语言模型(LLMs)已展现出多语言能力;然而,由于训练语料库的不平衡,它们大多以英语为中心。现有研究利用这一现象,主要通过翻译来提升其在自然语言处理(NLP)任务中的多语言表现。本研究将评估范围从NLP任务扩展到真实用户查询,并从以英语为中心的LLMs扩展到非以英语为中心的LLMs。虽然翻译成英语有助于提升以英语为中心的LLMs在多语言NLP任务上的性能,但这并非在所有场景下都是最优选择。对于需要深度语言理解的文化相关任务,使用母语进行提示往往更具前景,因为它能更好地捕捉文化和语言的细微差别。我们的实验揭示了不同LLMs和任务在多语言环境下的多样化行为。因此,我们主张进行更全面的多语言评估,并投入更多努力开发超越以英语为中心的多语言大语言模型。