Distinctive Image Captioning: Leveraging Ground Truth Captions in CLIP Guided Reinforcement Learning

Training image captioning models using teacher forcing results in very generic samples, whereas more distinctive captions can be very useful in retrieval applications or to produce alternative texts describing images for accessibility. Reinforcement Learning (RL) allows to use cross-modal retrieval similarity score between the generated caption and the input image as reward to guide the training, leading to more distinctive captions. Recent studies show that pre-trained cross-modal retrieval models can be used to provide this reward, completely eliminating the need for reference captions. However, we argue in this paper that Ground Truth (GT) captions can still be useful in this RL framework. We propose a new image captioning model training strategy that makes use of GT captions in different ways. Firstly, they can be used to train a simple MLP discriminator that serves as a regularization to prevent reward hacking and ensures the fluency of generated captions, resulting in a textual GAN setup extended for multimodal inputs. Secondly, they can serve as additional trajectories in the RL strategy, resulting in a teacher forcing loss weighted by the similarity of the GT to the image. This objective acts as an additional learning signal grounded to the distribution of the GT captions. Thirdly, they can serve as strong baselines when added to the pool of captions used to compute the proposed contrastive reward to reduce the variance of gradient estimate. Experiments on MS-COCO demonstrate the interest of the proposed training strategy to produce highly distinctive captions while maintaining high writing quality.

翻译：采用教师强制训练图像描述模型会产生非常通用的样本，而更具独特性的描述在检索应用或为图像生成替代文本以提升可访问性方面具有重要价值。强化学习允许使用生成描述与输入图像之间的跨模态检索相似度分数作为奖励来指导训练，从而产生更具独特性的描述。近年研究表明，预训练的跨模态检索模型可用于提供这种奖励，完全消除对参考描述的需求。然而，本文论证在此强化学习框架中，真实标注描述仍具有实用价值。我们提出一种新的图像描述模型训练策略，以不同方式利用真实标注描述。首先，它们可用于训练一个简单的MLP判别器，作为正则化手段防止奖励欺诈并确保生成描述的流畅性，形成扩展至多模态输入的文本GAN架构。其次，它们可作为强化学习策略中的额外轨迹，产生由真实标注与图像相似度加权的教师强制损失。该目标作为基于真实标注描述分布的附加学习信号。第三，在用于计算所提出的对比奖励的描述池中添加真实标注描述，可作为强基线降低梯度估计方差。在MS-COCO上的实验表明，该训练策略在保持高质量写作水平的同时，能产生高度独特的图像描述。