Generating visually grounded image captions with specific linguistic styles using unpaired stylistic corpora is a challenging task, especially since we expect stylized captions with a wide variety of stylistic patterns. In this paper, we propose a novel framework to generate Accurate and Diverse Stylized Captions (ADS-Cap). Our ADS-Cap first uses a contrastive learning module to align the image and text features, which unifies paired factual and unpaired stylistic corpora during the training process. A conditional variational auto-encoder is then used to automatically memorize diverse stylistic patterns in latent space and enhance diversity through sampling. We also design a simple but effective recheck module to boost style accuracy by filtering style-specific captions. Experimental results on two widely used stylized image captioning datasets show that regarding consistency with the image, style accuracy and diversity, ADS-Cap achieves outstanding performances compared to various baselines. We finally conduct extensive analyses to understand the effectiveness of our method. Our code is available at https://github.com/njucckevin/ADS-Cap.
翻译:使用未配对的风格语料库生成具有特定语言风格且符合图像内容的描述是一项具有挑战性的任务,尤其是当我们期望获得包含丰富风格模式的风格化描述时。本文提出了一种新颖的框架,用于生成精准且多样化的风格化描述(ADS-Cap)。我们的ADS-Cap首先利用对比学习模块对齐图像与文本特征,在训练过程中统一了配对的事实语料库与未配对的风格语料库。随后采用条件变分自编码器自动将多样化的风格模式记忆于潜在空间中,并通过采样增强多样性。我们还设计了一个简单有效的重检模块,通过筛选风格专属描述来提升风格准确性。在两个广泛使用的风格化图像描述数据集上的实验结果表明,在图像一致性、风格准确性和多样性方面,ADS-Cap相较于多种基线方法均取得了优异的性能。最后,我们通过大量分析深入理解了方法的有效性。我们的代码已开源至https://github.com/njucckevin/ADS-Cap。