In comparative linguistics, colexification refers to the phenomenon of a lexical form conveying two or more distinct meanings. Existing work on colexification patterns relies on annotated word lists, limiting scalability and usefulness in NLP. In contrast, we identify colexification patterns of more than 2,000 concepts across 1,335 languages directly from an unannotated parallel corpus. We then propose simple and effective methods to build multilingual graphs from the colexification patterns: ColexNet and ColexNet+. ColexNet's nodes are concepts and its edges are colexifications. In ColexNet+, concept nodes are additionally linked through intermediate nodes, each representing an ngram in one of 1,334 languages. We use ColexNet+ to train $\overrightarrow{\mbox{ColexNet+}}$, high-quality multilingual embeddings that are well-suited for transfer learning. In our experiments, we first show that ColexNet achieves high recall on CLICS, a dataset of crosslingual colexifications. We then evaluate $\overrightarrow{\mbox{ColexNet+}}$ on roundtrip translation, sentence retrieval and sentence classification and show that our embeddings surpass several transfer learning baselines. This demonstrates the benefits of using colexification as a source of information in multilingual NLP.
翻译:在比较语言学中,同词化(colexification)指同一词汇形式表达两种或多种不同含义的现象。现有关于同词化模式的研究依赖于标注词表,限制了其在自然语言处理中的可扩展性和实用性。相比之下,我们直接从无标注平行语料中识别出1335种语言中2000多个概念的同词化模式。随后,我们提出了基于同词化模式构建多语言图的简洁有效方法:ColexNet和ColexNet+。ColexNet的节点为概念,边为同词化关系。在ColexNet+中,概念节点通过中间节点(每个中间节点代表1334种语言之一的n元语法)进一步连接。我们利用ColexNet+训练$\overrightarrow{\mbox{ColexNet+}}$,这是一组适用于迁移学习的高质量多语言嵌入。在实验中,我们首先证明ColexNet在跨语言同词化数据集CLICS上实现了高召回率。随后,我们在翻译回译、句子检索和句子分类任务上评估$\overrightarrow{\mbox{ColexNet+}}$,结果显示我们的嵌入超越了多个迁移学习基线方法。这表明在同词化作为多语言自然语言处理中的信息源具有显著优势。