Colexification in comparative linguistics refers to the phenomenon of a lexical form conveying two or more distinct meanings. In this paper, we propose simple and effective methods to build multilingual graphs from colexification patterns: ColexNet and ColexNet+. ColexNet's nodes are concepts and its edges are colexifications. In ColexNet+, concept nodes are in addition linked through intermediate nodes, each representing an ngram in one of 1,334 languages. We use ColexNet+ to train high-quality multilingual embeddings $\overrightarrow{\mbox{ColexNet+}}$ that are well-suited for transfer learning scenarios. Existing work on colexification patterns relies on annotated word lists. This limits scalability and usefulness in NLP. In contrast, we identify colexification patterns of more than 2,000 concepts across 1,335 languages directly from an unannotated parallel corpus. In our experiments, we first show that ColexNet has a high recall on CLICS, a dataset of crosslingual colexifications. We then evaluate $\overrightarrow{\mbox{ColexNet+}}$ on roundtrip translation, verse retrieval and verse classification and show that our embeddings surpass several baselines in a transfer learning setting. This demonstrates the benefits of colexification for multilingual NLP.
翻译:比较语言学中的同词异义现象指同一词汇形式表达两个或以上不同含义的语言现象。本文提出基于同词异义模式构建多语图的简洁有效方法:ColexNet和ColexNet+。ColexNet以概念为节点、同词异义关系为边;ColexNet+则在概念节点间增设中间节点,每个节点代表1334种语言中某一语言的n元语法。我们利用ColexNet+训练出高质量多语嵌入向量$\overrightarrow{\mbox{ColexNet+}}$,该向量特别适用于迁移学习场景。现有同词异义研究依赖标注词表,限制了可扩展性及在自然语言处理中的实用性。相比之下,我们直接从未标注平行语料库中识别出1335种语言中2000余个概念的同词异义模式。实验首先证明ColexNet在跨语言同词异义数据集CLICS上具有高召回率。随后我们评估$\overrightarrow{\mbox{ColexNet+}}$在往返翻译、诗歌检索与诗歌分类任务中的表现,显示其嵌入向量在迁移学习场景中超越多项基线,证明了同词异义现象对多语自然语言处理的促进作用。