Unsupervised Cross-Lingual Part-of-Speech Tagging with Monolingual Corpora Only

Due to the scarcity of part-of-speech annotated data, existing studies on low-resource languages typically adopt unsupervised approaches for POS tagging. Among these, POS tag projection with word alignment method transfers POS tags from a high-resource source language to a low-resource target language based on parallel corpora, making it particularly suitable for low-resource language settings. However, this approach relies heavily on parallel corpora, which are often unavailable for many low-resource languages. To overcome this limitation, we propose a fully unsupervised cross-lingual part-of-speech(POS) tagging framework that relies solely on monolingual corpora by leveraging unsupervised neural machine translation(UNMT) system. This UNMT system first translates sentences from a high-resource language into a low-resource one, thereby constructing pseudo-parallel sentence pairs. Then, we train a POS tagger for the target language following the standard projection procedure based on word alignments. Moreover, we propose a multi-source projection technique to calibrate the projected POS tags on the target side, enhancing to train a more effective POS tagger. We evaluate our framework on 28 language pairs, covering four source languages (English, German, Spanish and French) and seven target languages (Afrikaans, Basque, Finnis, Indonesian, Lithuanian, Portuguese and Turkish). Experimental results show that our method can achieve performance comparable to the baseline cross-lingual POS tagger with parallel sentence pairs, and even exceeds it for certain target languages. Furthermore, our proposed multi-source projection technique further boosts performance, yielding an average improvement of 1.3% over previous methods.

翻译：由于词性标注数据的稀缺性，现有针对低资源语言的研究通常采用无监督方法进行词性标注。其中，基于词对齐的词性标签投影方法利用平行语料库将词性标签从高资源源语言迁移至低资源目标语言，使其特别适用于低资源语言场景。然而，该方法严重依赖平行语料库，而许多低资源语言往往缺乏此类资源。为克服这一限制，我们提出了一种完全无监督的跨语言词性标注框架，该框架仅依赖单语语料库，通过利用无监督神经机器翻译系统实现。该UNMT系统首先将高资源语言的句子翻译为低资源语言，从而构建伪平行句对。随后，我们基于词对齐遵循标准投影流程训练目标语言的词性标注器。此外，我们提出一种多源投影技术来校准目标端的投影词性标签，从而提升训练更有效词性标注器的能力。我们在28个语言对上评估了该框架，涵盖四种源语言（英语、德语、西班牙语和法语）和七种目标语言（南非荷兰语、巴斯克语、芬兰语、印尼语、立陶宛语、葡萄牙语和土耳其语）。实验结果表明，我们的方法能达到与基于平行句对的基线跨语言词性标注器相当的性能，甚至在某些目标语言上表现更优。此外，我们提出的多源投影技术进一步提升了性能，相比先前方法平均提高了1.3%。

相关内容

词性标注

关注 389

词性（part-of-speech）是词汇基本的语法属性，通常也称为词类。词性标注就是在给定句子中判定每个词的语法范畴，确定其词性并加以标注的过程，是中文信息处理面临的重要基础性问题。在语料库语言学中，词性标注（POS标注或PoS标注或POST），也称为语法标注，是将文本（语料库）中的单词标注为与特定词性相对应的过程，[1] 基于其定义和上下文。

【CVPR2024】GroupContrast：语义感知的自监督表示学习用于三维理解

专知会员服务

18+阅读 · 2024年3月15日

【经典书】自然语言标注—用于机器学习，341页pdf

专知会员服务

55+阅读 · 2021年2月12日

【Facebook AI】无监督机器翻译，336页ppt，Unsupervised Machine Translation

专知会员服务

19+阅读 · 2020年11月17日

【COLING2020】无监督依存解析的综述论文，12页pdf

专知会员服务

16+阅读 · 2020年10月27日