Word embeddings that map words into a fixed-dimensional vector space are the backbone of modern NLP. Most word embedding methods encode semantic information. However, phonetic information, which is important for some tasks, is often overlooked. In this work, we develop several novel methods which leverage articulatory features to build phonetically informed word embeddings, and present a set of phonetic word embeddings to encourage their community development, evaluation and use. While several methods for learning phonetic word embeddings already exist, there is a lack of consistency in evaluating their effectiveness. Thus, we also proposes several ways to evaluate both intrinsic aspects of phonetic word embeddings, such as word retrieval and correlation with sound similarity, and extrinsic performances, such as rhyme and cognate detection and sound analogies. We hope that our suite of tasks will promote reproducibility and provide direction for future research on phonetic word embeddings.
翻译:将单词映射到固定维度向量空间的词嵌入是现代自然语言处理的基石。大多数词嵌入方法编码语义信息,然而,对某些任务重要的音素信息往往被忽视。在这项工作中,我们开发了多种利用发音特征构建音素感知词嵌入的新方法,并提出一组音素词嵌入以促进其社区开发、评估和应用。尽管已有若干学习音素词嵌入的方法,但其有效性评估缺乏一致性。因此,我们进一步提出了多种评估方式:既包括词嵌入内在特性的评估(如词检索、与声音相似度的相关性),也包括外在性能的评估(如韵词和同源词检测、声音类比)。我们希望我们的任务套件能促进可重复性,并为音素词嵌入的未来研究提供方向。