Training large language models requires vast amounts of data, posing a challenge for less widely spoken languages like Norwegian and even more so for truly low-resource languages like S\'ami. To address this issue, we present a novel three-stage continual training approach. We also experiment with combining causal and masked language modeling to get more flexible models. Based on our findings, we train, evaluate, and openly release a new large generative language model for Norwegian Bokm\r{a}l, Nynorsk, and Northern S\'ami with 11.4 billion parameters: NorMistral-11B.
翻译:训练大型语言模型需要海量数据,这对挪威语等使用人口较少的语言构成挑战,对萨米语等真正低资源语言则更为严峻。为解决这一问题,我们提出了一种新颖的三阶段持续训练方法。我们还尝试结合因果语言建模与掩码语言建模,以获得更具灵活性的模型。基于研究结果,我们训练、评估并开源了针对挪威博克马尔语、新挪威语及北萨米语的新型大型生成式语言模型NorMistral-11B,该模型包含114亿参数。