Although image captioning models have made significant advancements in recent years, the majority of them heavily depend on high-quality datasets containing paired images and texts which are costly to acquire. Previous works leverage the CLIP's cross-modal association ability for image captioning, relying solely on textual information under unsupervised settings. However, not only does a modality gap exist between CLIP text and image features, but a discrepancy also arises between training and inference due to the unavailability of real-world images, which hinders the cross-modal alignment in text-only captioning. This paper proposes a novel method to address these issues by incorporating synthetic image-text pairs. A pre-trained text-to-image model is deployed to obtain images that correspond to textual data, and the pseudo features of generated images are optimized toward the real ones in the CLIP embedding space. Furthermore, textual information is gathered to represent image features, resulting in the image features with various semantics and the bridged modality gap. To unify training and inference, synthetic image features would serve as the training prefix for the language decoder, while real images are used for inference. Additionally, salient objects in images are detected as assistance to enhance the learning of modality alignment. Experimental results demonstrate that our method obtains the state-of-the-art performance on benchmark datasets.
翻译:尽管近年来图像描述模型取得了显著进展,但其中大多数模型严重依赖高昂成本获取的高质量图文配对数据集。先前工作利用CLIP的跨模态关联能力,在无监督设置下仅依赖文本信息进行图像描述。然而,CLIP文本与图像特征之间不仅存在模态差异,而且由于无法获取真实图像,训练与推理阶段之间也出现了差异,这阻碍了仅文本描述方法中的跨模态对齐。本文提出一种创新方法,通过引入合成图像-文本对来解决这些问题。我们部署预训练文本到图像模型生成与文本数据对应的图像,并在CLIP嵌入空间中将生成图像的伪特征优化至接近真实特征。进一步通过汇聚文本信息表征图像特征,获得具有丰富语义的图像特征,同时弥合模态差异。为统一训练与推理,合成图像特征作为语言解码器的训练前缀,而真实图像则用于推理。此外,检测图像中显著物体以辅助增强模态对齐学习。实验结果表明,本方法在基准数据集上取得了最优性能。