This study investigates the factors influencing the performance of multilingual large language models (MLLMs) across diverse languages. We study 6 MLLMs, including masked language models, autoregressive models, and instruction-tuned LLMs, on the SIB-200 dataset, a topic classification dataset encompassing 204 languages. Our analysis considers three scenarios: ALL languages, SEEN languages (present in the model's pretraining data), and UNSEEN languages (not present or documented in the model's pretraining data in any meaningful way). We examine the impact of factors such as pretraining data size, general resource availability, language family, and script type on model performance. Decision tree analysis reveals that pretraining data size is the most influential factor for SEEN languages. However, interestingly, script type and language family are crucial for UNSEEN languages, highlighting the importance of cross-lingual transfer learning. Notably, model size and architecture do not significantly alter the most important features identified. Our findings provide valuable insights into the strengths and limitations of current MLLMs and hope to guide the development of more effective and equitable multilingual NLP systems.
翻译:本研究探究了影响多语言大型语言模型(MLLMs)在不同语言上表现的因素。我们针对SIB-200数据集(一个涵盖204种语言的主题分类数据集)研究了6个MLLM,包括掩码语言模型、自回归模型以及指令微调的大语言模型。分析考虑了三种场景:所有语言、已见语言(存在于模型预训练数据中)和未见语言(未以任何有意义的方式出现在模型预训练数据中或记录在案)。我们考察了预训练数据规模、通用资源可用性、语系和文字类型等因素对模型性能的影响。决策树分析表明,对于已见语言,预训练数据规模是最具影响力的因素。然而,有趣的是,对于未见语言,文字类型和语系至关重要,凸显了跨语言迁移学习的重要性。值得注意的是,模型规模和架构并未显著改变所识别的最重要特征。我们的发现为当前MLLM的优势与局限提供了宝贵见解,并期望能指导更有效、更公平的多语言NLP系统的开发。