Vision-Language Pre-Training (VLP) has shown promising capabilities to align image and text pairs, facilitating a broad variety of cross-modal learning tasks. However, we observe that VLP models often lack the visual grounding/localization capability which is critical for many downstream tasks such as visual reasoning. In this work, we propose a novel Position-guided Text Prompt (PTP) paradigm to enhance the visual grounding ability of cross-modal models trained with VLP. Specifically, in the VLP phase, PTP divides the image into $N\times N$ blocks, and identifies the objects in each block through the widely used object detector in VLP. It then reformulates the visual grounding task into a fill-in-the-blank problem given a PTP by encouraging the model to predict the objects in the given blocks or regress the blocks of a given object, e.g. filling `P" or ``O" in aPTP ``The block P has a O". This mechanism improves the visual grounding capability of VLP models and thus helps them better handle various downstream tasks. By introducing PTP into several state-of-the-art VLP frameworks, we observe consistently significant improvements across representative cross-modal learning model architectures and several benchmarks, e.g. zero-shot Flickr30K Retrieval (+4.8 in average recall@1) for ViLT \cite{vilt} baseline, and COCO Captioning (+5.3 in CIDEr) for SOTA BLIP \cite{blip} baseline. Moreover, PTP achieves comparable results with object-detector based methods, and much faster inference speed since PTP discards its object detector for inference while the later cannot. Our code and pre-trained weight will be released at \url{https://github.com/sail-sg/ptp}.
翻译:视觉-语言预训练(VLP)在对齐图像与文本对方面展现出强大能力,从而促进了多种跨模态学习任务的发展。然而,我们观察到VLP模型往往缺乏对视觉定位/位置感知能力,而这正是视觉推理等下游任务的关键。本文提出一种新颖的位置引导文本提示(PTP)范式,用于增强基于VLP训练的跨模态模型的视觉定位能力。具体而言,在VLP阶段,PTP将图像划分为$N\times N$个区块,并通过VLP中广泛使用的目标检测器识别每个区块中的物体。随后,它通过构建PTP将视觉定位任务重构为填空问题:鼓励模型预测给定区块中的物体,或回归给定物体所在的区块,例如在PTP模板“区块P有一个O”中填入“P”或“O”。该机制提升了VLP模型的视觉定位能力,从而助力其更好地处理各类下游任务。将PTP引入多个先进的VLP框架后,我们在代表性跨模态模型架构及多项基准测试中观察到持续显著的性能提升,例如ViLT基线在零样本Flickr30K检索任务中平均召回率@1提升4.8,SOTA BLIP基线在COCO图像描述任务中CIDEr指标提升5.3。此外,PTP达到了与基于目标检测器的方法相当的性能,且推理速度大幅提升——因为PTP在推理阶段可丢弃其目标检测器,而后者则无法做到。我们的代码与预训练权重将在\url{https://github.com/sail-sg/ptp}开源。