Recently, vision-language models like CLIP have advanced the state of the art in a variety of multi-modal tasks including image captioning and caption evaluation. Many approaches adapt CLIP-style models to a downstream task by training a mapping network between CLIP and a language model. This is costly as it usually involves calculating gradients for large models. We propose a more efficient training protocol that fits a linear mapping between image and text embeddings of CLIP via a closed-form solution. This bypasses the need for gradient computation and results in a lightweight captioning method called ReCap, which can be trained up to 1000 times faster than existing lightweight methods. Moreover, we propose two new learning-based image-captioning metrics that build on CLIP score along with our linear mapping. Furthermore, we combine ReCap with our new metrics to design an iterative datastore-augmentation loop (DAL) based on synthetic captions. We evaluate ReCap on MS-COCO, Flickr30k, VizWiz, and MSRVTT. ReCap achieves performance comparable to state-of-the-art lightweight methods on established metrics while outperforming them on our new metrics, which are better aligned with human ratings on Flickr8k-Expert and Flickr8k-Crowdflower. Finally, we demonstrate that ReCap transfers well to other domains and that our DAL leads to a performance boost.
翻译:近期,诸如CLIP等视觉语言模型在包括图像描述和描述评估在内的多种多模态任务中取得了最先进的成果。许多方法通过训练CLIP与语言模型之间的映射网络来将CLIP类模型适配至下游任务,但这通常需要计算大型模型的梯度,成本较高。我们提出了一种更高效的训练方案,通过闭式解拟合CLIP图像嵌入与文本嵌入之间的线性映射。该方法无需梯度计算,并由此诞生了轻量级描述模型ReCap,其训练速度比现有轻量级方法快达1000倍。此外,我们基于CLIP分数与线性映射提出了两种新的基于学习的图像描述评估指标。进一步地,我们将ReCap与新指标相结合,设计了一种基于合成描述的迭代式数据存储增强循环(DAL)。我们在MS-COCO、Flickr30k、VizWiz和MSRVTT数据集上评估了ReCap。在现有指标上,ReCap的性能可与最先进的轻量级方法媲美,并在与人类评分(Flickr8k-Expert和Flickr8k-Crowdflower)更吻合的新指标上超越了它们。最后,我们证明了ReCap能良好地迁移至其他领域,且DAL可带来性能提升。