Part-of-Speech Tagger for Bodo Language using Deep Learning approach

Language Processing systems such as Part-of-speech tagging, Named entity recognition, Machine translation, Speech recognition, and Language modeling (LM) are well-studied in high-resource languages. Nevertheless, research on these systems for several low-resource languages, including Bodo, Mizo, Nagamese, and others, is either yet to commence or is in its nascent stages. Language model plays a vital role in the downstream tasks of modern NLP. Extensive studies are carried out on LMs for high-resource languages. Nevertheless, languages such as Bodo, Rabha, and Mising continue to lack coverage. In this study, we first present BodoBERT, a language model for the Bodo language. To the best of our knowledge, this work is the first such effort to develop a language model for Bodo. Secondly, we present an ensemble DL-based POS tagging model for Bodo. The POS tagging model is based on combinations of BiLSTM with CRF and stacked embedding of BodoBERT with BytePairEmbeddings. We cover several language models in the experiment to see how well they work in POS tagging tasks. The best-performing model achieves an F1 score of 0.8041. A comparative experiment was also conducted on Assamese POS taggers, considering that the language is spoken in the same region as Bodo.

翻译：诸如词性标注、命名实体识别、机器翻译、语音识别和语言建模（LM）等语言处理系统，在资源丰富语言中已得到充分研究。然而，对于包括博多语、米佐语、那加梅语在内的多种低资源语言，这些系统的研究要么尚未开展，要么仍处于初级阶段。语言模型在现代自然语言处理的下游任务中扮演着关键角色。针对资源丰富语言的语言模型已开展广泛研究，但博多语、拉巴语和米辛语等语言仍缺乏相关覆盖。本研究首先提出了BodoBERT——一种面向博多语的语言模型。据我们所知，这是首个为博多语开发语言模型的工作。其次，我们提出了一种基于集成深度学习的博多语词性标注模型，该模型采用BiLSTM与CRF的组合，以及BodoBERT与字节对嵌入的堆叠嵌入。我们通过实验涵盖了多种语言模型，以评估其在词性标注任务中的表现。性能最优的模型达到了0.8041的F1分数。此外，考虑到阿萨姆语与博多语在同一地区使用，我们还在阿萨姆语词性标注器上进行了对比实验。

相关内容

词性标注

关注 389

词性（part-of-speech）是词汇基本的语法属性，通常也称为词类。词性标注就是在给定句子中判定每个词的语法范畴，确定其词性并加以标注的过程，是中文信息处理面临的重要基础性问题。在语料库语言学中，词性标注（POS标注或PoS标注或POST），也称为语法标注，是将文本（语料库）中的单词标注为与特定词性相对应的过程，[1] 基于其定义和上下文。

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日