News Image Captioning aims to create captions from news articles and images, emphasizing the connection between textual context and visual elements. Recognizing the significance of human faces in news images and the face-name co-occurrence pattern in existing datasets, we propose a face-naming module for learning better name embeddings. Apart from names, which can be directly linked to an image area (faces), news image captions mostly contain context information that can only be found in the article. We design a retrieval strategy using CLIP to retrieve sentences that are semantically close to the image, mimicking human thought process of linking articles to images. Furthermore, to tackle the problem of the imbalanced proportion of article context and image context in captions, we introduce a simple yet effective method Contrasting with Language Model backbone (CoLaM) to the training pipeline. We conduct extensive experiments to demonstrate the efficacy of our framework. We out-perform the previous state-of-the-art (without external data) by 7.97/5.80 CIDEr scores on GoodNews/NYTimes800k. Our code is available at https://github.com/tingyu215/VACNIC.
翻译:新闻图像描述旨在根据新闻文章和图像生成标题,强调文本上下文与视觉元素之间的联系。鉴于人脸在新闻图像中的重要性以及现有数据集中人脸-名称共现模式,我们提出了一个人脸命名模块,以学习更好的名称嵌入。除了可直接关联图像区域(人脸)的名称外,新闻图像标题主要包含仅在文章中可查到的上下文信息。我们设计了一种基于CLIP的检索策略,用于获取与图像语义接近的句子,模拟人类将文章与图像关联的思维过程。此外,为解决标题中文章上下文与图像上下文比例失衡的问题,我们提出了一种名为CoLaM(基于语言模型骨干的对比方法)的简单有效训练管线。通过大量实验证明了框架的有效性。在GoodNews与NYTimes800k数据集上,我们的方法分别以7.97/5.80的CIDEr分数超越此前最优方案(未使用外部数据)。代码已开源至https://github.com/tingyu215/VACNIC。