There has been a surge in the development of various Large Language Models (LLMs). However, text generation for languages other than English often faces significant challenges, including poor generation quality and reduced computational performance due to the disproportionate representation of tokens in the model's vocabulary. In this work, we address these issues by developing a pipeline for the adaptation of English-oriented pre-trained models to other languages and constructing efficient bilingual LLMs. Using this pipeline, we construct Vikhr, a series of bilingual open-source instruction-following LLMs designed specifically for the Russian language. ``Vikhr'' refers to the name of the Mistral LLM series and means a ``strong gust of wind.'' Unlike previous Russian-language models that typically rely on LoRA adapters on top of English-oriented models, sacrificing performance for lower training costs, Vikhr features an adapted tokenizer vocabulary and undergoes the continued pre-training and instruction tuning of all weights. This not only enhances the model's performance but also significantly improves its computational and contextual efficiency. We also expanded the instruction datasets and corpora for continued pre-training. The model weights, instruction sets, and code are publicly available.
翻译:近年来,各类大语言模型(LLMs)的开发呈现爆发式增长。然而,针对英语以外语言的文本生成往往面临重大挑战,包括生成质量不佳以及因模型词汇表中词元分布不均导致的计算性能下降。在本工作中,我们通过开发一套将面向英语的预训练模型适配至其他语言的流程,并构建高效的双语大语言模型,以解决这些问题。利用该流程,我们构建了Vikhr——一个专门为俄语设计的双语开源指令遵循大语言模型系列。“Vikhr”一词源自Mistral大语言模型系列的名称,意为“强阵风”。与以往通常依赖在面向英语模型之上添加LoRA适配器、以牺牲性能换取较低训练成本的俄语模型不同,Vikhr采用了适配后的分词器词汇表,并对所有权重进行了持续预训练和指令调优。这不仅提升了模型的性能,还显著改善了其计算效率和上下文效率。我们还扩展了用于持续预训练的指令数据集和语料库。模型权重、指令集及代码均已公开。