What Drives Performance in Multilingual Language Models?

This study investigates the factors influencing the performance of multilingual large language models (MLLMs) across diverse languages. We study 6 MLLMs, including masked language models, autoregressive models, and instruction-tuned LLMs, on the SIB-200 dataset, a topic classification dataset encompassing 204 languages. Our analysis considers three scenarios: ALL languages, SEEN languages (present in the model's pretraining data), and UNSEEN languages (not present or documented in the model's pretraining data in any meaningful way). We examine the impact of factors such as pretraining data size, general resource availability, language family, and script type on model performance. Decision tree analysis reveals that pretraining data size is the most influential factor for SEEN languages. However, interestingly, script type and language family are crucial for UNSEEN languages, highlighting the importance of cross-lingual transfer learning. Notably, model size and architecture do not significantly alter the most important features identified. Our findings provide valuable insights into the strengths and limitations of current MLLMs and hope to guide the development of more effective and equitable multilingual NLP systems.

翻译：本研究探究了影响多语言大型语言模型（MLLMs）在不同语言上表现的因素。我们针对SIB-200数据集（一个涵盖204种语言的主题分类数据集）研究了6个MLLM，包括掩码语言模型、自回归模型以及指令微调的大语言模型。分析考虑了三种场景：所有语言、已见语言（存在于模型预训练数据中）和未见语言（未以任何有意义的方式出现在模型预训练数据中或记录在案）。我们考察了预训练数据规模、通用资源可用性、语系和文字类型等因素对模型性能的影响。决策树分析表明，对于已见语言，预训练数据规模是最具影响力的因素。然而，有趣的是，对于未见语言，文字类型和语系至关重要，凸显了跨语言迁移学习的重要性。值得注意的是，模型规模和架构并未显著改变所识别的最重要特征。我们的发现为当前MLLM的优势与局限提供了宝贵见解，并期望能指导更有效、更公平的多语言NLP系统的开发。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/