Unleashing Text-to-Image Diffusion Prior for Zero-Shot Image Captioning

Recently, zero-shot image captioning has gained increasing attention, where only text data is available for training. The remarkable progress in text-to-image diffusion model presents the potential to resolve this task by employing synthetic image-caption pairs generated by this pre-trained prior. Nonetheless, the defective details in the salient regions of the synthetic images introduce semantic misalignment between the synthetic image and text, leading to compromised results. To address this challenge, we propose a novel Patch-wise Cross-modal feature Mix-up (PCM) mechanism to adaptively mitigate the unfaithful contents in a fine-grained manner during training, which can be integrated into most of encoder-decoder frameworks, introducing our PCM-Net. Specifically, for each input image, salient visual concepts in the image are first detected considering the image-text similarity in CLIP space. Next, the patch-wise visual features of the input image are selectively fused with the textual features of the salient visual concepts, leading to a mixed-up feature map with less defective content. Finally, a visual-semantic encoder is exploited to refine the derived feature map, which is further incorporated into the sentence decoder for caption generation. Additionally, to facilitate the model training with synthetic data, a novel CLIP-weighted cross-entropy loss is devised to prioritize the high-quality image-text pairs over the low-quality counterparts. Extensive experiments on MSCOCO and Flickr30k datasets demonstrate the superiority of our PCM-Net compared with state-of-the-art VLMs-based approaches. It is noteworthy that our PCM-Net ranks first in both in-domain and cross-domain zero-shot image captioning. The synthetic dataset SynthImgCap and code are available at https://jianjieluo.github.io/SynthImgCap.

翻译：近年来，零样本图像描述任务日益受到关注，该任务仅利用文本数据进行训练。文本到图像扩散模型的显著进展为通过利用该预训练先验生成的合成图像-描述对来解决此任务提供了可能。然而，合成图像显著区域中的缺陷细节引入了合成图像与文本之间的语义错位，导致结果受损。为应对这一挑战，我们提出了一种新颖的基于图像块的跨模态特征混合（PCM）机制，在训练过程中以细粒度方式自适应地减轻不忠实的内容，该机制可集成到大多数编码器-解码器框架中，形成我们的PCM-Net。具体而言，对于每个输入图像，首先考虑CLIP空间中的图像-文本相似度来检测图像中的显著视觉概念。接着，将输入图像的基于图像块的视觉特征与显著视觉概念的文本特征进行选择性融合，从而产生一个缺陷内容更少的混合特征图。最后，利用一个视觉-语义编码器来精炼所得到的特征图，并将其进一步整合到句子解码器中以生成描述。此外，为促进模型使用合成数据进行训练，我们设计了一种新颖的基于CLIP权重的交叉熵损失，以优先处理高质量图像-文本对而非低质量对。在MSCOCO和Flickr30k数据集上进行的大量实验证明了我们的PCM-Net相较于最先进的基于视觉语言模型方法的优越性。值得注意的是，我们的PCM-Net在域内和跨域零样本图像描述任务中均排名第一。合成数据集SynthImgCap和代码可在 https://jianjieluo.github.io/SynthImgCap 获取。