We propose a visually grounded speech model that acquires new words and their visual depictions from just a few word-image example pairs. Given a set of test images and a spoken query, we ask the model which image depicts the query word. Previous work has simplified this problem by either using an artificial setting with digit word-image pairs or by using a large number of examples per class. We propose an approach that can work on natural word-image pairs but with less examples, i.e. fewer shots. Our approach involves using the given word-image example pairs to mine new unsupervised word-image training pairs from large collections of unlabelled speech and images. Additionally, we use a word-to-image attention mechanism to determine word-image similarity. With this new model, we achieve better performance with fewer shots than any existing approach.
翻译:我们提出了一种基于视觉的语音模型,该模型仅通过少量词-图像示例对就能习得新词及其视觉表征。给定一组测试图像和一个语音查询,我们要求模型判断哪张图像描绘了查询词。以往的研究通过采用数字词-图像对的人工设置或每类使用大量示例来简化这一问题。我们提出了一种方法,该方法能够处理自然词-图像对,但使用的示例更少,即实现更少的样本。我们的方法涉及利用给定的词-图像示例对,从大量无标签的语音和图像集合中挖掘新的无监督词-图像训练对。此外,我们使用一种词到图像的注意力机制来确定词-图像相似度。通过这一新模型,我们在更少样本下实现了比现有任何方法更优的性能。