We introduce SynGround, a novel framework that combines data-driven learning and knowledge transfer from various large-scale pretrained models to enhance the visual grounding capabilities of a pretrained vision-and-language model. The knowledge transfer from the models initiates the generation of image descriptions through an image description generator. These descriptions serve dual purposes: they act as prompts for synthesizing images through a text-to-image generator, and as queries for synthesizing text, from which phrases are extracted using a large language model. Finally, we leverage an open-vocabulary object detector to generate synthetic bounding boxes for the synthetic images and texts. We finetune a pretrained vision-and-language model on this dataset by optimizing a mask-attention consistency objective that aligns region annotations with gradient-based model explanations. The resulting model improves the grounding capabilities of an off-the-shelf vision-and-language model. Particularly, SynGround improves the pointing game accuracy of ALBEF on the Flickr30k dataset from 79.38% to 87.26%, and on RefCOCO+ Test A from 69.35% to 79.06% and on RefCOCO+ Test B from 53.77% to 63.67%.
翻译:我们提出了SynGround,一种新颖的框架,它结合了数据驱动学习与来自多种大规模预训练模型的知识迁移,以增强预训练视觉-语言模型的视觉定位能力。来自模型的知识迁移通过图像描述生成器启动图像描述的生成。这些描述具有双重用途:它们作为提示词,通过文本到图像生成器合成图像;同时作为查询词,用于合成文本,并利用大语言模型从中提取短语。最后,我们利用开放词汇目标检测器为合成图像和文本生成合成边界框。我们通过优化掩码注意力一致性目标(该目标将区域标注与基于梯度的模型解释对齐)在生成的数据集上微调预训练的视觉-语言模型。所得模型提升了现成视觉-语言模型的定位能力。特别地,SynGround在Flickr30k数据集上将ALBEF的指代游戏准确率从79.38%提升至87.26%,在RefCOCO+测试集A上从69.35%提升至79.06%,在RefCOCO+测试集B上从53.77%提升至63.67%。