Recently, vision-language models like CLIP have advanced the state of the art in a variety of multi-modal tasks including image captioning and caption evaluation. Many approaches adapt CLIP-style models to a downstream task by training a mapping network between CLIP and a language model. This is costly as it usually involves calculating gradients for large models. We propose a more efficient training protocol that fits a linear mapping between image and text embeddings of CLIP via a closed-form solution. This bypasses the need for gradient computation and results in a lightweight captioning method called ReCap, which can be trained up to 1000 times faster than existing lightweight methods. Moreover, we propose two new learning-based image-captioning metrics that build on CLIP score along with our linear mapping. Furthermore, we combine ReCap with our new metrics to design an iterative datastore-augmentation loop (DAL) based on synthetic captions. We evaluate ReCap on MS-COCO, Flickr30k, VizWiz, and MSRVTT. ReCap achieves performance comparable to state-of-the-art lightweight methods on established metrics while outperforming them on our new metrics, which are better aligned with human ratings on Flickr8k-Expert and Flickr8k-Crowdflower. Finally, we demonstrate that ReCap transfers well to other domains and that our DAL leads to a performance boost.
翻译:最近,诸如CLIP等视觉语言模型在包括图像描述和描述评估在内的多种多模态任务中取得了先进成果。许多方法通过训练CLIP与语言模型之间的映射网络来将CLIP风格模型适配到下游任务。这通常需要计算大型模型的梯度,成本高昂。我们提出了一种更高效的训练协议,通过闭式解在CLIP的图像嵌入与文本嵌入之间拟合线性映射。该方法避免了梯度计算的需求,并产生了一种轻量级的描述方法ReCap,其训练速度可比现有轻量级方法快1000倍。此外,我们提出了两种基于CLIP分数与线性映射的新型学习型图像描述指标。进一步,我们将ReCap与新指标结合,设计了一种基于合成描述的迭代数据存储增强循环(DAL)。我们在MS-COCO、Flickr30k、VizWiz和MSRVTT上评估了ReCap。在既有指标上,ReCap达到了与最先进轻量级方法相当的性能,同时在我们提出的新指标上超越它们,这些新指标与Flickr8k-Expert和Flickr8k-Crowdflower上的人类评分更为一致。最后,我们证明了ReCap能良好迁移到其他领域,且我们的DAL能带来性能提升。