Pre-trained language models are trained on large-scale unsupervised data, and they can fine-turn the model only on small-scale labeled datasets, and achieve good results. Multilingual pre-trained language models can be trained on multiple languages, and the model can understand multiple languages at the same time. At present, the search on pre-trained models mainly focuses on rich resources, while there is relatively little research on low-resource languages such as minority languages, and the public multilingual pre-trained language model can not work well for minority languages. Therefore, this paper constructs a multilingual pre-trained model named MiLMo that performs better on minority language tasks, including Mongolian, Tibetan, Uyghur, Kazakh and Korean. To solve the problem of scarcity of datasets on minority languages and verify the effectiveness of the MiLMo model, this paper constructs a minority multilingual text classification dataset named MiTC, and trains a word2vec model for each language. By comparing the word2vec model and the pre-trained model in the text classification task, this paper provides an optimal scheme for the downstream task research of minority languages. The final experimental results show that the performance of the pre-trained model is better than that of the word2vec model, and it has achieved the best results in minority multilingual text classification. The multilingual pre-trained model MiLMo, multilingual word2vec model and multilingual text classification dataset MiTC are published on http://milmo.cmli-nlp.com/.
翻译:预训练语言模型通过大规模无监督语料训练,仅需少量标注数据集即可微调模型并取得良好效果。多语言预训练语言模型能够同时处理多种语言的训练数据,使模型具备多语言理解能力。当前预训练模型研究主要聚焦于资源丰富的语言,而对少数民族语言等低资源语言的研究相对较少,现有公开的多语言预训练语言模型在少数民族语言任务上表现不佳。为此,本文构建了名为MiLMo的多语言预训练模型,该模型在包含蒙古语、藏语、维吾尔语、哈萨克语和朝鲜语的少数民族语言任务中表现更优。针对少数民族语言数据集匮乏及验证MiLMo模型有效性的问题,本文构建了名为MiTC的少数民族多语言文本分类数据集,并为每种语言训练了word2vec模型。通过对比word2vec模型与预训练模型在文本分类任务中的表现,本文为少数民族语言下游任务研究提供了最优方案。最终实验结果表明,预训练模型的性能优于word2vec模型,并在少数民族多语言文本分类中取得了最佳结果。多语言预训练模型MiLMo、多语言word2vec模型及多语言文本分类数据集MiTC已发布于http://milmo.cmli-nlp.com/。