CapEnrich: Enriching Caption Semantics for Web Images via Cross-modal Pre-trained Knowledge

Automatically generating textual descriptions for massive unlabeled images on the web can greatly benefit realistic web applications, e.g. multimodal retrieval and recommendation. However, existing models suffer from the problem of generating ``over-generic'' descriptions, such as their tendency to generate repetitive sentences with common concepts for different images. These generic descriptions fail to provide sufficient textual semantics for ever-changing web images. Inspired by the recent success of Vision-Language Pre-training (VLP) models that learn diverse image-text concept alignment during pretraining, we explore leveraging their cross-modal pre-trained knowledge to automatically enrich the textual semantics of image descriptions. With no need for additional human annotations, we propose a plug-and-play framework, i.e CapEnrich, to complement the generic image descriptions with more semantic details. Specifically, we first propose an automatic data-building strategy to get desired training sentences, based on which we then adopt prompting strategies, i.e. learnable and template prompts, to incentivize VLP models to generate more textual details. For learnable templates, we fix the whole VLP model and only tune the prompt vectors, which leads to two advantages: 1) the pre-training knowledge of VLP models can be reserved as much as possible to describe diverse visual concepts; 2) only lightweight trainable parameters are required, so it is friendly to low data resources. Extensive experiments show that our method significantly improves the descriptiveness and diversity of generated sentences for web images. The code is available at https://github.com/yaolinli/CapEnrich.

翻译：自动为互联网上海量无标注图像生成文本描述，能极大促进多模态检索与推荐等实际网络应用的发展。然而，现有模型存在生成“过度通用”描述的问题，例如倾向于为不同图像生成包含常见概念的重复句子。这些通用描述无法为不断变化的网络图像提供足够的文本语义。受近期视觉-语言预训练（VLP）模型成功的启发——这类模型在预训练过程中学习了多样化的图像-文本概念对齐——我们探索利用其跨模态预训练知识，自动丰富图像描述的文本语义。无需额外人工标注，我们提出了一种即插即用的框架CapEnrich，为通用图像描述补充更多语义细节。具体而言，我们首先提出一种自动数据构建策略来获取所需训练句子，在此基础上采用提示策略（即可学习提示和模板提示）激励VLP模型生成更多文本细节。对于可学习模板，我们固定整个VLP模型仅调整提示向量，这带来两个优势：1）可尽可能保留VLP模型的预训练知识以描述多样化的视觉概念；2）仅需轻量级可训练参数，从而对低数据资源场景友好。大量实验表明，我们的方法显著提升了网络图像生成句子的描述性与多样性。代码已开源至https://github.com/yaolinli/CapEnrich。