Large pre-trained multimodal models have demonstrated significant success in a range of downstream tasks, including image captioning, image-text retrieval, visual question answering (VQA), etc. However, many of these methods rely on image-text pairs collected from the web as pre-training data and unfortunately overlook the need for fine-grained feature alignment between vision and language modalities, which requires detailed understanding of images and language expressions. While integrating VQA and dense captioning (DC) into pre-training can address this issue, acquiring image-question-answer as well as image-location-caption triplets is challenging and time-consuming. Additionally, publicly available datasets for VQA and dense captioning are typically limited in scale due to manual data collection and labeling efforts. In this paper, we propose a novel method called Joint QA and DC GEneration (JADE), which utilizes a pre-trained multimodal model and easily-crawled image-text pairs to automatically generate and filter large-scale VQA and dense captioning datasets. We apply this method to the Conceptual Caption (CC3M) dataset to generate a new dataset called CC3M-QA-DC. Experiments show that when used for pre-training in a multi-task manner, CC3M-QA-DC can improve the performance with various backbones on various downstream tasks. Furthermore, our generated CC3M-QA-DC can be combined with larger image-text datasets (e.g., CC15M) and achieve competitive results compared with models using much more data. Code and dataset will be released.
翻译:大规模预训练多模态模型已在包括图像描述、图像-文本检索、视觉问答(VQA)等多项下游任务中展现出显著成功。然而,许多此类方法依赖从网络收集的图像-文本对作为预训练数据,并遗憾地忽视了视觉与语言模态间需要细粒度特征对齐的需求——这要求对图像与语言表达具有深入理解。虽然将VQA和密集描述(DC)融入预训练可解决此问题,但获取图像-问题-答案三元组以及图像-位置-描述三元组具有挑战性且耗时。此外,由于人工数据收集与标注工作,公开可用的VQA和密集描述数据集通常规模有限。本文提出名为联合问答与密集描述生成(JADE)的新方法,该方法利用预训练多模态模型与易于抓取的图像-文本对,自动生成并过滤大规模VQA与密集描述数据集。我们将该方法应用于Conceptual Caption(CC3M)数据集,生成了名为CC3M-QA-DC的新数据集。实验表明,当以多任务方式用于预训练时,CC3M-QA-DC可在多种下游任务中提升基于不同骨干网络的性能。此外,我们生成的CC3M-QA-DC可与更大规模的图像-文本数据集(如CC15M)结合,并取得与使用更多数据的模型相竞争的结果。代码与数据集将公开提供。