Contrastive image-text models such as CLIP form the building blocks of many state-of-the-art systems. While they excel at recognizing common generic concepts, they still struggle on fine-grained entities which are rare, or even absent from the pre-training dataset. Hence, a key ingredient to their success has been the use of large-scale curated pre-training data aiming at expanding the set of concepts that they can memorize during the pre-training stage. In this work, we explore an alternative to encoding fine-grained knowledge directly into the model's parameters: we instead train the model to retrieve this knowledge from an external memory. Specifically, we propose to equip existing vision-text models with the ability to refine their embedding with cross-modal retrieved information from a memory at inference time, which greatly improves their zero-shot predictions. Remarkably, we show that this can be done with a light-weight, single-layer, fusion transformer on top of a frozen CLIP. Our experiments validate that our retrieval-enhanced contrastive (RECO) training improves CLIP performance substantially on several challenging fine-grained tasks: for example +10.9 on Stanford Cars, +10.2 on CUB-2011 and +7.3 on the recent OVEN benchmark.
翻译:对比式图像-文本模型(如CLIP)构成了许多先进系统的基础模块。尽管这些模型在识别常见通用概念方面表现优异,但在处理训练数据中罕见甚至完全缺失的细粒度实体时仍存在困难。因此,其成功的关键因素之一在于使用大规模精心筛选的预训练数据,旨在扩展预训练阶段可记忆的概念集。本研究探索了一种替代方案:不将细粒度知识直接编码到模型参数中,而是训练模型从外部记忆体中检索此类知识。具体而言,我们提出为现有视觉-文本模型赋予一种能力——在推理阶段通过跨模态检索的外部记忆信息优化其嵌入表示,从而大幅提升零样本预测性能。值得注意的是,我们证明这一目标可在冻结CLIP模型的基础上通过轻量级单层融合Transformer实现。实验验证表明,我们提出的检索增强对比(RECO)训练策略在多个具有挑战性的细粒度任务上显著提升了CLIP性能:例如在斯坦福汽车数据集上提升10.9%,在CUB-2011数据集上提升10.2%,在最新OVEN基准测试上提升7.3%。