Vision-Language Pre-training (VLP) methods based on object detection enjoy the rich knowledge of fine-grained object-text alignment but at the cost of computationally expensive inference. Recent Visual-Transformer (ViT)-based approaches circumvent this issue while struggling with long visual sequences without detailed cross-modal alignment information. This paper introduces a ViT-based VLP technique that efficiently incorporates object information through a novel patch-text alignment mechanism. Specifically, we convert object-level signals into patch-level ones and devise a Patch-Text Alignment pre-training task (PTA) to learn a text-aware patch detector. By using off-the-shelf delicate object annotations in 5\% training images, we jointly train PTA with other conventional VLP objectives in an end-to-end manner, bypassing the high computational cost of object detection and yielding an effective patch detector that accurately detects text-relevant patches, thus considerably reducing patch sequences and accelerating computation within the ViT backbone. Our experiments on a variety of widely-used benchmarks reveal that our method achieves a speedup of nearly 88\% compared to prior VLP models while maintaining competitive or superior performance on downstream tasks with similar model size and data scale.
翻译:基于目标检测的视觉-语言预训练方法虽能利用细粒度对象-文本对齐的丰富知识,但代价是推理计算成本高昂。近期基于视觉Transformer的方法规避了这一问题,却面临缺乏细粒度跨模态对齐信息的长视觉序列处理难题。本文提出一种基于视觉Transformer的视觉-语言预训练技术,通过新颖的补丁-文本对齐机制高效融入对象信息。具体而言,我们将对象级信号转化为补丁级信号,并设计补丁-文本对齐预训练任务以学习文本感知的补丁检测器。通过仅使用5%训练图像中现成的精细对象标注,我们将该任务与其他传统视觉-语言预训练目标以端到端方式联合训练,既避免了目标检测的高计算成本,又获得了能精准检测文本相关补丁的高效检测器,从而大幅减少补丁序列长度并加速视觉Transformer主干网络的计算。在多个广泛使用的基准测试上的实验表明,在相似模型规模与数据规模下,本方法相比现有视觉-语言预训练模型实现了近88%的加速,同时在下游任务中保持具有竞争力或更优的性能。