Transformer language models (LMs) are fundamental to NLP research methodologies and applications in various languages. However, developing such models specifically for the Russian language has received little attention. This paper introduces a collection of 13 Russian Transformer LMs, which spans encoder (ruBERT, ruRoBERTa, ruELECTRA), decoder (ruGPT-3), and encoder-decoder (ruT5, FRED-T5) architectures. We provide a report on the model architecture design and pretraining, and the results of evaluating their generalization abilities on Russian language understanding and generation datasets and benchmarks. By pretraining and releasing these specialized Transformer LMs, we aim to broaden the scope of the NLP research directions and enable the development of industrial solutions for the Russian language.
翻译:Transformer语言模型(LMs)是多种语言自然语言处理研究方法与应用的基础。然而,专门为俄语开发此类模型的研究仍鲜有关注。本文介绍了一套包含13个俄语Transformer LM的集合,涵盖编码器(ruBERT、ruRoBERTa、ruELECTRA)、解码器(ruGPT-3)及编码器-解码器(ruT5、FRED-T5)架构。我们报告了模型架构设计与预训练过程,以及在俄语理解与生成数据集及基准测试中对其泛化能力的评估结果。通过预训练并发布这些专业化Transformer LM,我们旨在拓展俄语自然语言处理研究方向,并推动工业级解决方案的开发。