Nowadays, Transformer language models (LMs) represent a fundamental component of the NLP research methodologies and applications. However, the development of such models specifically for the Russian language has received little attention. This paper presents a collection of 13 Russian Transformer LMs based on the encoder (ruBERT, ruRoBERTa, ruELECTRA), decoder (ruGPT-3), and encoder-decoder (ruT5, FRED-T5) models in multiple sizes. Access to these models is readily available via the HuggingFace platform. We provide a report of the model architecture design and pretraining, and the results of evaluating their generalization abilities on Russian natural language understanding and generation datasets and benchmarks. By pretraining and releasing these specialized Transformer LMs, we hope to broaden the scope of the NLP research directions and enable the development of industrial solutions for the Russian language.
翻译:当前,Transformer语言模型已成为自然语言处理研究方法与应用的基础组件。然而,专门针对俄语开发的此类模型研究相对匮乏。本文提出了一套涵盖编码器(ruBERT、ruRoBERTa、ruELECTRA)、解码器(ruGPT-3)及编码器-解码器(ruT5、FRED-T5)架构的13种俄语Transformer语言模型,并提供多种规格版本。这些模型可通过HuggingFace平台直接获取。我们详述了模型架构设计、预训练过程,以及在俄语自然语言理解与生成数据集及基准测试上的泛化能力评估结果。通过预训练并开源这些专用Transformer语言模型,我们期望拓展俄语NLP研究方向,并推动相关工业级解决方案的开发。