Could We Have Had Better Multilingual LLMs If English Was Not the Central Language?

Large Language Models (LLMs) demonstrate strong machine translation capabilities on languages they are trained on. However, the impact of factors beyond training data size on translation performance remains a topic of debate, especially concerning languages not directly encountered during training. Our study delves into Llama2's translation capabilities. By modeling a linear relationship between linguistic feature distances and machine translation scores, we ask ourselves if there are potentially better central languages for LLMs other than English. Our experiments show that the 7B Llama2 model yields above 10 BLEU when translating into all languages it has seen, which rarely happens for languages it has not seen. Most translation improvements into unseen languages come from scaling up the model size rather than instruction tuning or increasing shot count. Furthermore, our correlation analysis reveals that syntactic similarity is not the only linguistic factor that strongly correlates with machine translation scores. Interestingly, we discovered that under specific circumstances, some languages (e.g. Swedish, Catalan), despite having significantly less training data, exhibit comparable correlation levels to English. These insights challenge the prevailing landscape of LLMs, suggesting that models centered around languages other than English could provide a more efficient foundation for multilingual applications.

翻译：大语言模型（LLMs）在其训练过的语言上展现出强大的机器翻译能力。然而，除训练数据规模外，影响翻译性能的其他因素仍存争议，尤其针对训练中未直接接触的语言。本研究深入探究了Llama2的翻译能力。通过建立语言特征距离与机器翻译分数之间的线性关系模型，我们提出疑问：是否存在比英语更适合作为LLMs核心语言的语言？实验表明，7B参数的Llama2模型在翻译其见过的所有语言时，BLEU值均超过10；而对于未见过的语言，这一情况极少发生。大多数面向未见语言的翻译提升主要源于模型规模扩大，而非指令微调或增加样本数量。进一步的相关性分析揭示，句法相似性并非唯一与机器翻译分数强相关的语言因素。有趣的是，我们发现特定条件下，某些语言（如瑞典语、加泰罗尼亚语）尽管训练数据显著较少，却展现出与英语相当的相关性水平。这些发现挑战了当前LLMs的主流范式，表明以英语以外语言为核心的模型或能为多语言应用提供更高效的基础。

相关内容

Machine Translation

关注 210

机器翻译（Machine Translation）涵盖计算语言学和语言工程的所有分支，包含多语言方面。特色论文涵盖理论，描述或计算方面的任何下列主题:双语和多语语料库的编写和使用，计算机辅助语言教学，非罗马字符集的计算含义，连接主义翻译方法，对比语言学等。官网地址：http://dblp.uni-trier.de/db/journals/mt/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日