How can we learn the mapping between written words and their spoken counterparts in the absence of explicit textual supervision? We present a visually grounded method for building a vocabulary of spoken words using only images and their spoken descriptions. First, image captioning systems are used to build a vocabulary of written words representing salient visual concepts in the images. For each word, we then find utterances whose image captions contain that word. Then we use an unsupervised word discovery technique to align these utterances to locate instances of the target word. The result is spoken word segments that are linked to written words -- all accomplished without any text supervision. In spoken word retrieval and keyword spotting experiments, the proposed approach outperforms a strong neural baseline while being more interpretable. These results demonstrate the feasibility of the approach in English and motivate future work on low-resource languages without transcripts.
翻译:在缺乏显式文本监督的情况下,我们如何习得书面文字与其语音表达之间的映射关系?本文提出一种基于视觉线索的方法,仅利用图像及其语音描述构建口语词汇库。首先,使用图像描述系统建立代表图像中显著视觉概念的书面词汇库。针对每个词汇,我们筛选出图像描述中包含该词的语音片段,继而采用无监督词汇发现技术对齐这些片段以定位目标词汇实例。最终获得与书面词汇关联的口语词汇片段——全程无需任何文本监督。在口语词汇检索与关键词检测实验中,所提方法在保持更高可解释性的同时,超越了强神经基线模型。这些结果验证了该方法在英语环境中的可行性,并为缺乏文本标注的低资源语言研究提供了新的切入点。