Acoustic word embeddings (AWEs) are fixed-dimensional vector representations of speech segments that encode phonetic content so that different realisations of the same word have similar embeddings. In this paper we explore semantic AWE modelling. These AWEs should not only capture phonetics but also the meaning of a word (similar to textual word embeddings). We consider the scenario where we only have untranscribed speech in a target language. We introduce a number of strategies leveraging a pre-trained multilingual AWE model -- a phonetic AWE model trained on labelled data from multiple languages excluding the target. Our best semantic AWE approach involves clustering word segments using the multilingual AWE model, deriving soft pseudo-word labels from the cluster centroids, and then training a Skipgram-like model on the soft vectors. In an intrinsic word similarity task measuring semantics, this multilingual transfer approach outperforms all previous semantic AWE methods. We also show -- for the first time -- that AWEs can be used for downstream semantic query-by-example search.
翻译:声学词嵌入(AWE)是语音片段的固定维度向量表示,其编码语音内容,使得同一单词的不同实现具有相似的嵌入。本文探讨了语义AWE建模,这类AWE不仅应捕捉语音特征,还应体现单词的语义(类似于文本词嵌入)。我们考虑目标语言仅有未转录语音的场景,提出了一系列利用预训练多语言AWE模型(即基于多语言标记数据训练但不包含目标语言的语音AWE模型)的策略。我们最优的语义AWE方法包括:使用多语言AWE模型对词段进行聚类,从聚类质心导出软伪词标签,然后对软向量训练类似Skipgram的模型。在面向语义的固有词语相似度任务中,这种多语言迁移方法优于所有先前的语义AWE方法。同时,我们首次证明AWE可用于下游的语义示例查询搜索。