The NLP community has mainly focused on scaling Large Language Models (LLMs) vertically, i.e., making them better for about 100 languages. We instead scale LLMs horizontally: we create, through continued pretraining, Glot500-m, an LLM that covers 511 predominantly low-resource languages. An important part of this effort is to collect and clean Glot500-c, a corpus that covers these 511 languages and allows us to train Glot500-m. We evaluate Glot500-m on five diverse tasks across these languages. We observe large improvements for both high-resource and low-resource languages compared to an XLM-R baseline. Our analysis shows that no single factor explains the quality of multilingual LLM representations. Rather, a combination of factors determines quality including corpus size, script, "help" from related languages and the total capacity of the model. Our work addresses an important goal of NLP research: we should not limit NLP to a small fraction of the world's languages and instead strive to support as many languages as possible to bring the benefits of NLP technology to all languages and cultures. Code, data and models are available at https://github.com/cisnlp/Glot500.
翻译:自然语言处理(NLP)社区主要致力于纵向扩展大型语言模型(LLMs),即提升约100种语言的性能。本文则聚焦于横向扩展LLMs:通过持续预训练,我们创建了Glot500-m,一个覆盖511种主要为低资源语言的LLM。这项工作的重要环节是收集并清洗Glot500-c语料库,该库涵盖这511种语言,为训练Glot500-m提供了基础。我们在这些语言的五项不同任务上评估了Glot500-m。与XLM-R基线相比,我们观察到高资源语言和低资源语言均取得了显著改进。分析表明,没有任何单一因素能完全解释多语言LLM表示的质量;相反,语料库规模、文字系统、相关语言的"帮助"以及模型总容量等多种因素共同决定了表示质量。我们的工作实现了NLP研究的一个核心目标:不应将NLP局限于全球语言的一小部分,而应努力支持尽可能多的语言,使NLP技术的惠益惠及所有语言与文化。代码、数据和模型已开源:https://github.com/cisnlp/Glot500。