Very low-resource languages, having only a few million tokens worth of data, are not well-supported by multilingual NLP approaches due to poor quality cross-lingual word representations. Recent work showed that good cross-lingual performance can be achieved if a source language is related to the low-resource target language. However, not all language pairs are related. In this paper, we propose to build multilingual word embeddings (MWEs) via a novel language chain-based approach, that incorporates intermediate related languages to bridge the gap between the distant source and target. We build MWEs one language at a time by starting from the resource rich source and sequentially adding each language in the chain till we reach the target. We extend a semi-joint bilingual approach to multiple languages in order to eliminate the main weakness of previous works, i.e., independently trained monolingual embeddings, by anchoring the target language around the multilingual space. We evaluate our method on bilingual lexicon induction for 4 language families, involving 4 very low-resource (<5M tokens) and 4 moderately low-resource (<50M) target languages, showing improved performance in both categories. Additionally, our analysis reveals the importance of good quality embeddings for intermediate languages as well as the importance of leveraging anchor points from all languages in the multilingual space.
翻译:极低资源语言(仅有数百万词元数据)因跨语言词表示质量低下,难以获得多语言自然语言处理方法的充分支持。近期研究表明,若源语言与低资源目标语言具有关联性,则可获得良好的跨语言性能。然而,并非所有语言对都存在关联。本文提出一种新颖的语言链方法构建多语言词嵌入,通过引入中间关联语言来弥合远距离源语言与目标语言间的鸿沟。我们采用逐语言构建策略:从资源丰富的源语言出发,沿语言链依次添加各语言直至目标语言。为克服先前研究的主要缺陷(即独立训练的单语词嵌入),我们将半联合双语方法扩展至多语言场景,通过将目标语言锚定至多语言空间实现优化。我们在4个语系的双语词典归纳任务上评估该方法,涵盖4种极低资源语言(<500万词元)与4种中等低资源语言(<5000万词元),结果显示该方法在两类任务中均取得性能提升。此外,分析表明中间语言的高质量嵌入及多语言空间中锚点利用策略的重要性。