Multimodal few-shot learning is challenging due to the large domain gap between vision and language modalities. Existing methods are trying to communicate visual concepts as prompts to frozen language models, but rely on hand-engineered task induction to reduce the hypothesis space. To make the whole process learnable, we introduce a multimodal meta-learning approach. Specifically, our approach decomposes the training of the model into a set of related multimodal few-shot tasks. We define a meta-mapper network, acting as a meta-learner, to efficiently bridge frozen large-scale vision and language models and leverage their already learned capacity. By updating the learnable parameters only of the meta-mapper, it learns to accrue shared meta-knowledge among these tasks. Thus, it can rapidly adapt to newly presented samples with only a few gradient updates. Importantly, it induces the task in a completely data-driven manner, with no need for a hand-engineered task induction. We evaluate our approach on recently proposed multimodal few-shot benchmarks, measuring how rapidly the model can bind novel visual concepts to words and answer visual questions by observing only a limited set of labeled examples. The experimental results show that our meta-learning approach outperforms the baseline across multiple datasets and various training settings while being computationally more efficient.
翻译:多模态小样本学习因视觉与语言模态间的巨大领域差距而极具挑战性。现有方法试图将视觉概念作为提示词传递给冻结的语言模型,但依赖手工设计的任务归纳来缩小假设空间。为使整个过程可学习,我们提出一种多模态元学习方法。具体而言,该方法将模型训练分解为一组相关的多模态小样本任务。我们定义了一个元映射器网络作为元学习器,高效桥接冻结的大规模视觉与语言模型,并利用其已习得的能力。仅更新元映射器的可学习参数,即可学习累积跨任务的共享元知识,从而通过少量梯度更新快速适应新样本。更重要的是,它以完全数据驱动的方式归纳任务,无需手工设计的任务归纳。我们在近期提出的多模态小样本基准上评估该方法,测量模型在仅观察有限标注样本的情况下,将新视觉概念与词汇绑定并回答视觉问题的速度。实验结果表明,我们的元学习方法在多个数据集和不同训练设置下均优于基线,同时计算效率更高。