Existing multi-style image captioning methods show promising results in generating a caption with accurate visual content and desired linguistic style. However, existing methods overlook the relationship between linguistic style and visual content. To overcome this drawback, we propose style-aware contrastive learning for multi-style image captioning. First, we present a style-aware visual encoder with contrastive learning to mine potential visual content relevant to style. Moreover, we propose a style-aware triplet contrast objective to distinguish whether the image, style and caption matched. To provide positive and negative samples for contrastive learning, we present three retrieval schemes: object-based retrieval, RoI-based retrieval and triplet-based retrieval, and design a dynamic trade-off function to calculate retrieval scores. Experimental results demonstrate that our approach achieves state-of-the-art performance. In addition, we conduct an extensive analysis to verify the effectiveness of our method.
翻译:现有面向多风格图像描述的方法在生成兼具准确视觉内容与期望语言风格的描述方面已取得显著成效。然而,现有方法忽视了语言风格与视觉内容之间的关联性。为解决这一缺陷,我们提出面向多风格图像描述的样式感知对比学习方法。首先,设计基于对比学习的样式感知视觉编码器,用于挖掘与风格相关的潜在视觉内容。此外,提出样式感知三元组对比目标,以判别图像、风格与描述三者是否匹配。为提供对比学习所需的正负样本,我们提出三种检索方案:基于目标的检索、基于RoI的检索及基于三元组的检索,并设计动态权衡函数计算检索得分。实验结果表明,本方法实现了最先进的性能。同时,通过广泛分析验证了所提方法的有效性。