Mapping words into a fixed-dimensional vector space is the backbone of modern NLP. While most word embedding methods successfully encode semantic information, they overlook phonetic information that is crucial for many tasks. We develop three methods that use articulatory features to build phonetically informed word embeddings. To address the inconsistent evaluation of existing phonetic word embedding methods, we also contribute a task suite to fairly evaluate past, current, and future methods. We evaluate both (1) intrinsic aspects of phonetic word embeddings, such as word retrieval and correlation with sound similarity, and (2) extrinsic performance on tasks such as rhyme and cognate detection and sound analogies. We hope our task suite will promote reproducibility and inspire future phonetic embedding research.
翻译:将单词映射到固定维度的向量空间是现代自然语言处理的基石。尽管大多数词嵌入方法能成功编码语义信息,但它们忽略了在许多任务中至关重要的音系信息。我们开发了三种利用发音特征构建具有音系信息的词嵌入的方法。针对现有音系词嵌入方法评估不一致的问题,我们还贡献了一个任务套件,用于公平评估过去、现在及未来的方法。我们评估了(1)音系词嵌入的内在属性,如词汇检索及其与声音相似度的相关性,以及(2)在押韵检测、同源词检测与声音类比等任务上的外在表现。我们希望我们的任务套件能促进可复现性,并启发未来的音系嵌入研究。