Image captioning, a fundamental task in vision-language understanding, seeks to generate accurate natural language descriptions for provided images. The CLIP model, with its rich semantic features learned from a large corpus of image-text pairs, is well-suited for this task. In this paper, we present a two-stage semi-supervised image captioning approach that exploits the potential of CLIP encoding. Our model comprises a CLIP visual encoder, a mapping network, and a language model for text generation. In the initial stage, we train the model using a small labeled dataset by contrasting the generated captions with the ground truth captions. In the subsequent stage, we continue the training using unlabeled images, aiming to maximize the image-caption similarity based on CLIP embeddings. Remarkably, despite utilizing less than 2% of the COCO-captions, our approach delivers a performance comparable to state-of-the-art models trained on the complete dataset. Furthermore, the captions generated by our approach are more distinctive, informative, and in line with human preference.
翻译:图像描述生成是视觉-语言理解中的基础任务,旨在为给定图像生成准确的自然语言描述。CLIP模型通过在海量图文对中学习到的丰富语义特征,非常适合此类任务。本文提出了一种两阶段半监督图像描述生成方法,充分挖掘了CLIP编码的潜力。该模型包含CLIP视觉编码器、映射网络和用于文本生成的语言模型。第一阶段,我们通过对比生成描述与真实描述,使用小型标注数据集训练模型;第二阶段,我们利用未标注图像继续训练,旨在基于CLIP嵌入最大化图像-描述相似度。值得注意的是,尽管仅使用COCO-captions数据集不到2%的标注数据,本文方法仍能达到与在完整数据集上训练的最先进模型相当的性能。此外,本文生成的描述更具独特性、信息性,且更符合人类偏好。