We propose a visually grounded speech model that learns new words and their visual depictions from just a few word-image example pairs. Given a set of test images and a spoken query, we ask the model which image depicts the query word. Previous work has simplified this few-shot learning problem by either using an artificial setting with digit word-image pairs or by using a large number of examples per class. Moreover, all previous studies were performed using English speech-image data. We propose an approach that can work on natural word-image pairs but with less examples, i.e. fewer shots, and then illustrate how this approach can be applied for multimodal few-shot learning in a real low-resource language, Yor\`ub\'a. Our approach involves using the given word-image example pairs to mine new unsupervised word-image training pairs from large collections of unlabelled speech and images. Additionally, we use a word-to-image attention mechanism to determine word-image similarity. With this new model, we achieve better performance with fewer shots than previous approaches on an existing English benchmark. Many of the model's mistakes are due to confusion between visual concepts co-occurring in similar contexts. The experiments on Yor\`ub\'a show the benefit of transferring knowledge from a multimodal model trained on a larger set of English speech-image data.
翻译:我们提出了一种视觉引导的语音模型,该模型仅通过少量词-图像示例对即可学习新词及其视觉表征。给定一组测试图像和一个语音查询,模型需判断哪张图像描述了该查询词。以往研究通过使用数字词-图像对的人工设置或每类大量示例的方式简化了这一少样本学习问题。此外,所有先前研究均基于英语语音-图像数据。我们提出了一种方法,可在自然词-图像对中仅使用较少示例(即更低样本量)实现学习,并进一步阐明该方法如何应用于真实低资源语言——约鲁巴语的多模态少样本学习。我们的方法利用给定的词-图像示例对,从大规模无标签语音和图像数据中挖掘新的无监督词-图像训练对。同时,我们采用词到图像的注意力机制计算词-图像相似度。通过这一新模型,我们在现有英语基准测试中以更少样本量实现了优于先前方法的性能。模型的大部分错误源于对相似语境中共同出现的视觉概念的混淆。在约鲁巴语上的实验表明,从基于更大规模英语语音-图像数据训练的多模态模型中迁移知识具有显著优势。