Enhancing Vision-Language Pre-Training with Jointly Learned Questioner and Dense Captioner

Large pre-trained multimodal models have demonstrated significant success in a range of downstream tasks, including image captioning, image-text retrieval, visual question answering (VQA), etc. However, many of these methods rely on image-text pairs collected from the web as pre-training data and unfortunately overlook the need for fine-grained feature alignment between vision and language modalities, which requires detailed understanding of images and language expressions. While integrating VQA and dense captioning (DC) into pre-training can address this issue, acquiring image-question-answer as well as image-location-caption triplets is challenging and time-consuming. Additionally, publicly available datasets for VQA and dense captioning are typically limited in scale due to manual data collection and labeling efforts. In this paper, we propose a novel method called Joint QA and DC GEneration (JADE), which utilizes a pre-trained multimodal model and easily-crawled image-text pairs to automatically generate and filter large-scale VQA and dense captioning datasets. We apply this method to the Conceptual Caption (CC3M) dataset to generate a new dataset called CC3M-QA-DC. Experiments show that when used for pre-training in a multi-task manner, CC3M-QA-DC can improve the performance with various backbones on various downstream tasks. Furthermore, our generated CC3M-QA-DC can be combined with larger image-text datasets (e.g., CC15M) and achieve competitive results compared with models using much more data. Code and dataset are available at https://github.com/johncaged/OPT_Questioner.

翻译：大规模预训练多模态模型在图像描述、图像-文本检索、视觉问答等一系列下游任务中展现出显著成功。然而，许多这类方法依赖从网络收集的图像-文本对作为预训练数据，且遗憾地忽视了视觉与语言模态间需要细粒度特征对齐的问题，这要求对图像和语言表达进行深入理解。尽管将视觉问答和密集描述整合到预训练中能解决这一问题，但获取图像-问题-答案以及图像-位置-描述三元组仍具挑战且耗时。此外，由于人工数据收集和标注工作，公开可用的视觉问答和密集描述数据集规模通常有限。本文提出一种名为联合问答与密集描述生成的新方法，该方法利用预训练多模态模型和易于爬取的图像-文本对，自动生成并过滤大规模视觉问答和密集描述数据集。我们将该方法应用于概念描述数据集，生成名为CC3M-QA-DC的新数据集。实验表明，当以多任务方式用于预训练时，CC3M-QA-DC能在各种骨干网络上提升多项下游任务的性能。此外，我们生成的CC3M-QA-DC可与更大规模的图像-文本数据集（如CC15M）结合，在使用更多数据的模型对比中取得竞争性结果。代码和数据集获取地址：https://github.com/johncaged/OPT_Questioner。