CgT-GAN: CLIP-guided Text GAN for Image Captioning

The large-scale visual-language pre-trained model, Contrastive Language-Image Pre-training (CLIP), has significantly improved image captioning for scenarios without human-annotated image-caption pairs. Recent advanced CLIP-based image captioning without human annotations follows a text-only training paradigm, i.e., reconstructing text from shared embedding space. Nevertheless, these approaches are limited by the training/inference gap or huge storage requirements for text embeddings. Given that it is trivial to obtain images in the real world, we propose CLIP-guided text GAN (CgT-GAN), which incorporates images into the training process to enable the model to "see" real visual modality. Particularly, we use adversarial training to teach CgT-GAN to mimic the phrases of an external text corpus and CLIP-based reward to provide semantic guidance. The caption generator is jointly rewarded based on the caption naturalness to human language calculated from the GAN's discriminator and the semantic guidance reward computed by the CLIP-based reward module. In addition to the cosine similarity as the semantic guidance reward (i.e., CLIP-cos), we further introduce a novel semantic guidance reward called CLIP-agg, which aligns the generated caption with a weighted text embedding by attentively aggregating the entire corpus. Experimental results on three subtasks (ZS-IC, In-UIC and Cross-UIC) show that CgT-GAN outperforms state-of-the-art methods significantly across all metrics. Code is available at https://github.com/Lihr747/CgtGAN.

翻译：大规模视觉-语言预训练模型——对比语言-图像预训练（CLIP）显著提升了在无人工标注图像-描述对场景下的图像描述性能。当前基于CLIP的无人工标注图像描述方法遵循纯文本训练范式，即从共享嵌入空间中重建文本。然而，这些方法受限于训练/推理阶段的不一致性或文本嵌入的庞大存储需求。鉴于现实世界中图像获取的便捷性，我们提出CLIP引导的文本生成对抗网络（CgT-GAN），通过将图像融入训练过程，使模型能够"看见"真实视觉模态。具体而言，我们利用对抗训练教导CgT-GAN模仿外部文本语料库的短语模式，并通过基于CLIP的奖励提供语义引导。描述生成器依据两个奖励进行联合优化：由生成对抗网络判别器计算的描述自然度（接近人类语言的程度）以及由基于CLIP的奖励模块计算的语义引导奖励。在余弦相似度作为语义引导奖励（即CLIP-cos）的基础上，我们进一步提出名为CLIP-agg的新型语义引导奖励——通过注意力聚合整个语料库，将生成描述与加权文本嵌入对齐。在三个子任务（零样本图像描述、域内无监督图像描述和跨域无监督图像描述）上的实验结果表明，CgT-GAN在所有指标上均显著优于现有最先进方法。代码开源地址为https://github.com/Lihr747/CgtGAN。