Due to the scarcity of part-of-speech annotated data, existing studies on low-resource languages typically adopt unsupervised approaches for POS tagging. Among these, POS tag projection with word alignment method transfers POS tags from a high-resource source language to a low-resource target language based on parallel corpora, making it particularly suitable for low-resource language settings. However, this approach relies heavily on parallel corpora, which are often unavailable for many low-resource languages. To overcome this limitation, we propose a fully unsupervised cross-lingual part-of-speech(POS) tagging framework that relies solely on monolingual corpora by leveraging unsupervised neural machine translation(UNMT) system. This UNMT system first translates sentences from a high-resource language into a low-resource one, thereby constructing pseudo-parallel sentence pairs. Then, we train a POS tagger for the target language following the standard projection procedure based on word alignments. Moreover, we propose a multi-source projection technique to calibrate the projected POS tags on the target side, enhancing to train a more effective POS tagger. We evaluate our framework on 28 language pairs, covering four source languages (English, German, Spanish and French) and seven target languages (Afrikaans, Basque, Finnis, Indonesian, Lithuanian, Portuguese and Turkish). Experimental results show that our method can achieve performance comparable to the baseline cross-lingual POS tagger with parallel sentence pairs, and even exceeds it for certain target languages. Furthermore, our proposed multi-source projection technique further boosts performance, yielding an average improvement of 1.3% over previous methods.
翻译:由于词性标注数据的稀缺性,现有针对低资源语言的研究通常采用无监督方法进行词性标注。其中,基于词对齐的词性标签投影方法利用平行语料库将词性标签从高资源源语言迁移至低资源目标语言,使其特别适用于低资源语言场景。然而,该方法严重依赖平行语料库,而许多低资源语言往往缺乏此类资源。为克服这一限制,我们提出了一种完全无监督的跨语言词性标注框架,该框架仅依赖单语语料库,通过利用无监督神经机器翻译系统实现。该UNMT系统首先将高资源语言的句子翻译为低资源语言,从而构建伪平行句对。随后,我们基于词对齐遵循标准投影流程训练目标语言的词性标注器。此外,我们提出一种多源投影技术来校准目标端的投影词性标签,从而提升训练更有效词性标注器的能力。我们在28个语言对上评估了该框架,涵盖四种源语言(英语、德语、西班牙语和法语)和七种目标语言(南非荷兰语、巴斯克语、芬兰语、印尼语、立陶宛语、葡萄牙语和土耳其语)。实验结果表明,我们的方法能达到与基于平行句对的基线跨语言词性标注器相当的性能,甚至在某些目标语言上表现更优。此外,我们提出的多源投影技术进一步提升了性能,相比先前方法平均提高了1.3%。