Mapping words into a fixed-dimensional vector space is the backbone of modern NLP. While most word embedding methods successfully encode semantic information, they overlook phonetic information that is crucial for many tasks. We develop three methods that use articulatory features to build phonetically informed word embeddings. To address the inconsistent evaluation of existing phonetic word embedding methods, we also contribute a task suite to fairly evaluate past, current, and future methods. We evaluate both (1) intrinsic aspects of phonetic word embeddings, such as word retrieval and correlation with sound similarity, and (2) extrinsic performance on tasks such as rhyme and cognate detection and sound analogies. We hope our task suite will promote reproducibility and inspire future phonetic embedding research.
翻译:将词汇映射至固定维度向量空间是现代自然语言处理的基石。尽管多数词嵌入方法能有效编码语义信息,却忽略了诸多任务中至关重要的音系信息。我们开发了三种基于发音特征构建音系感知词嵌入的方法。针对现有音系词嵌入方法评估标准不统一的问题,我们还贡献了一套任务套件,用以公平评估过去、当前及未来的方法。我们评估了音系词嵌入的(1)内在特性(如词汇检索与语音相似度相关性)及(2)外在任务性能(如押韵检测、同源词识别与声音类比)。希望该任务套件能促进可重复性研究,并启发未来的音系嵌入探索。