Humans have the ability to learn novel compositional concepts by recalling and generalizing primitive concepts acquired from past experiences. Inspired by this observation, in this paper, we propose MetaReVision, a retrieval-enhanced meta-learning model to address the visually grounded compositional concept learning problem. The proposed MetaReVision consists of a retrieval module and a meta-learning module which are designed to incorporate retrieved primitive concepts as a supporting set to meta-train vision-anguage models for grounded compositional concept recognition. Through meta-learning from episodes constructed by the retriever, MetaReVision learns a generic compositional representation that can be fast updated to recognize novel compositional concepts. We create CompCOCO and CompFlickr to benchmark the grounded compositional concept learning. Our experimental results show that MetaReVision outperforms other competitive baselines and the retrieval module plays an important role in this compositional learning process.
翻译:人类能够通过回忆并泛化从过往经验中习得的原始概念,学习新颖的组合概念。受此启发,本文提出MetaReVision——一种检索增强的元学习模型,以解决视觉基础组合概念学习问题。该模型由检索模块和元学习模块组成,通过将检索到的原始概念作为支持集,对视觉-语言模型进行元训练,实现基于基础组合概念的识别。通过利用检索器构建的片段进行元学习,MetaReVision可习得通用的组合式表征,并快速更新以识别新颖的组合概念。我们构建了CompCOCO和CompFlickr基准数据集以评估该基础组合概念学习任务。实验结果表明,MetaReVision优于其他竞争基线方法,且检索模块在此组合学习过程中发挥关键作用。