Language Processing systems such as Part-of-speech tagging, Named entity recognition, Machine translation, Speech recognition, and Language modeling (LM) are well-studied in high-resource languages. Nevertheless, research on these systems for several low-resource languages, including Bodo, Mizo, Nagamese, and others, is either yet to commence or is in its nascent stages. Language model plays a vital role in the downstream tasks of modern NLP. Extensive studies are carried out on LMs for high-resource languages. Nevertheless, languages such as Bodo, Rabha, and Mising continue to lack coverage. In this study, we first present BodoBERT, a language model for the Bodo language. To the best of our knowledge, this work is the first such effort to develop a language model for Bodo. Secondly, we present an ensemble DL-based POS tagging model for Bodo. The POS tagging model is based on combinations of BiLSTM with CRF and stacked embedding of BodoBERT with BytePairEmbeddings. We cover several language models in the experiment to see how well they work in POS tagging tasks. The best-performing model achieves an F1 score of 0.8041. A comparative experiment was also conducted on Assamese POS taggers, considering that the language is spoken in the same region as Bodo.
翻译:诸如词性标注、命名实体识别、机器翻译、语音识别和语言建模(LM)等语言处理系统,在资源丰富语言中已得到充分研究。然而,对于包括博多语、米佐语、那加梅语在内的多种低资源语言,这些系统的研究要么尚未开展,要么仍处于初级阶段。语言模型在现代自然语言处理的下游任务中扮演着关键角色。针对资源丰富语言的语言模型已开展广泛研究,但博多语、拉巴语和米辛语等语言仍缺乏相关覆盖。本研究首先提出了BodoBERT——一种面向博多语的语言模型。据我们所知,这是首个为博多语开发语言模型的工作。其次,我们提出了一种基于集成深度学习的博多语词性标注模型,该模型采用BiLSTM与CRF的组合,以及BodoBERT与字节对嵌入的堆叠嵌入。我们通过实验涵盖了多种语言模型,以评估其在词性标注任务中的表现。性能最优的模型达到了0.8041的F1分数。此外,考虑到阿萨姆语与博多语在同一地区使用,我们还在阿萨姆语词性标注器上进行了对比实验。