As a fundamental tool for natural language processing (NLP), the part-of-speech (POS) tagger assigns the POS label to each word in a sentence. A novel lightweight POS tagger based on word embeddings is proposed and named GWPT (green word-embedding-based POS tagger) in this work. Following the green learning (GL) methodology, GWPT contains three modules in cascade: 1) representation learning, 2) feature learning, and 3) decision learning modules. The main novelty of GWPT lies in representation learning. It uses non-contextual or contextual word embeddings, partitions embedding dimension indices into low-, medium-, and high-frequency sets, and represents them with different N-grams. It is shown by experimental results that GWPT offers state-of-the-art accuracies with fewer model parameters and significantly lower computational complexity in both training and inference as compared with deep-learning-based methods.
翻译:作为自然语言处理(NLP)的基础工具,词性(POS)标注器为句子中的每个词分配词性标签。本文提出了一种基于词嵌入的新型轻量级词性标注器,命名为GWPT(绿色词嵌入词性标注器)。遵循绿色学习(GL)方法论,GWPT包含三个级联模块:1)表示学习模块、2)特征学习模块和3)决策学习模块。GWPT的主要创新点在于表示学习模块。该模块利用非上下文或上下文词嵌入,将嵌入维度索引划分为低频、中频和高频集合,并用不同的N-gram表示它们。实验结果表明,与基于深度学习的方法相比,GWPT在模型参数更少、训练和推理计算复杂度显著降低的同时,达到了最先进的准确率。