The field of cross-lingual sentence embeddings has recently experienced significant advancements, but research concerning low-resource languages has lagged due to the scarcity of parallel corpora. This paper shows that cross-lingual word representation in low-resource languages is notably under-aligned with that in high-resource languages in current models. To address this, we introduce a novel framework that explicitly aligns words between English and eight low-resource languages, utilizing off-the-shelf word alignment models. This framework incorporates three primary training objectives: aligned word prediction and word translation ranking, along with the widely used translation ranking. We evaluate our approach through experiments on the bitext retrieval task, which demonstrate substantial improvements on sentence embeddings in low-resource languages. In addition, the competitive performance of the proposed model across a broader range of tasks in high-resource languages underscores its practicality.
翻译:跨语言句子嵌入领域近期取得了显著进展,但由于平行语料稀缺,针对低资源语言的研究仍相对滞后。本文表明,当前模型中低资源语言的跨语言词表征与高资源语言之间的对齐程度明显不足。为解决这一问题,我们提出了一种新颖的框架,通过利用现成的词对齐模型,显式地将英语与八种低资源语言的词语进行对齐。该框架包含三个主要训练目标:对齐词预测、词翻译排序以及广泛使用的翻译排序。我们通过双向文本检索任务实验评估了该方法,结果表明其显著提升了低资源语言的句子嵌入质量。此外,该模型在高资源语言的更广泛任务中展现出具有竞争力的性能,进一步验证了其实用性。