Part-of-speech tagging for Nagamese Language using CRF

This paper investigates part-of-speech tagging, an important task in Natural Language Processing (NLP) for the Nagamese language. The Nagamese language, a.k.a. Naga Pidgin, is an Assamese-lexified Creole language developed primarily as a means of communication in trade between the Nagas and people from Assam in northeast India. A substantial amount of work in part-of-speech-tagging has been done for resource-rich languages like English, Hindi, etc. However, no work has been done in the Nagamese language. To the best of our knowledge, this is the first attempt at part-of-speech tagging for the Nagamese Language. The aim of this work is to identify the part-of-speech for a given sentence in the Nagamese language. An annotated corpus of 16,112 tokens is created and applied machine learning technique known as Conditional Random Fields (CRF). Using CRF, an overall tagging accuracy of 85.70%; precision, recall of 86%, and f1-score of 85% is achieved. Keywords. Nagamese, NLP, part-of-speech, machine learning, CRF.

翻译：本文研究了自然语言处理（NLP）中一项重要任务——纳加语的词性标注。纳加语，亦称纳加皮钦语，是一种以阿萨姆语词汇为基础的克里奥尔语，主要作为印度东北部那加人与阿萨姆人之间贸易往来的沟通工具而发展起来。目前，针对英语、印地语等资源丰富语言的词性标注已有大量研究工作，然而纳加语的相关研究尚属空白。据我们所知，这是首次针对纳加语进行词性标注的尝试。本工作的目标是对纳加语句子中的词汇进行词性标注。我们构建了一个包含16,112个标记的标注语料库，并应用了称为条件随机场（CRF）的机器学习技术。使用CRF模型，我们实现了整体标注准确率85.70%；精确率86%，召回率86%，F1分数85%的性能指标。关键词：纳加语，自然语言处理，词性标注，机器学习，条件随机场。

相关内容

词性标注

关注 389

词性（part-of-speech）是词汇基本的语法属性，通常也称为词类。词性标注就是在给定句子中判定每个词的语法范畴，确定其词性并加以标注的过程，是中文信息处理面临的重要基础性问题。在语料库语言学中，词性标注（POS标注或PoS标注或POST），也称为语法标注，是将文本（语料库）中的单词标注为与特定词性相对应的过程，[1] 基于其定义和上下文。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【WSDM2020】超越统计关系：将知识关系整合到多标签音乐风格分类的风格关联中（附pdf）

专知会员服务

18+阅读 · 2019年11月23日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日