Vision-language pre-training (VLP) has recently proven highly effective for various uni- and multi-modal downstream applications. However, most existing end-to-end VLP methods use high-resolution image-text box data to perform well on fine-grained region-level tasks, such as object detection, segmentation, and referring expression comprehension. Unfortunately, such high-resolution images with accurate bounding box annotations are expensive to collect and use for supervision at scale. In this work, we propose VoLTA (Vision-Language Transformer with weakly-supervised local-feature Alignment), a new VLP paradigm that only utilizes image-caption data but achieves fine-grained region-level image understanding, eliminating the use of expensive box annotations. VoLTA adopts graph optimal transport-based weakly-supervised alignment on local image patches and text tokens to germinate an explicit, self-normalized, and interpretable low-level matching criterion. In addition, VoLTA pushes multi-modal fusion deep into the uni-modal backbones during pre-training and removes fusion-specific transformer layers, further reducing memory requirements. Extensive experiments on a wide range of vision- and vision-language downstream tasks demonstrate the effectiveness of VoLTA on fine-grained applications without compromising the coarse-grained downstream performance, often outperforming methods using significantly more caption and box annotations.
翻译:摘要:视觉-语言预训练(VLP)最近在各类单模态与多模态下游应用中展现出显著效果。然而,现有大多数端到端VLP方法需依赖高分辨率图文框数据,才能在高细粒度区域级任务(如目标检测、分割及指代表达式理解)中取得良好表现。但此类带精确边界框标注的高分辨率图像获取成本高昂,难以大规模用于监督学习。本文提出VoLTA(基于弱监督局部特征对齐的视觉-语言Transformer),一种仅利用图像-字幕数据即可实现细粒度区域级图像理解的VLP新范式,消除了对昂贵边界框标注的依赖。VoLTA采用基于图最优传输的弱监督对齐方法,在局部图像块与文本标记间建立显式、自归一化且可解释的低层匹配准则。此外,VoLTA在预训练阶段将多模态融合深度嵌入单模态主干网络,并移除专用融合Transformer层,进一步降低内存需求。在广泛的视觉与视觉-语言下游任务上的实验表明,VoLTA在不牺牲粗粒度下游性能的前提下,显著提升了细粒度应用效果,且其表现常超越依赖更多字幕与框标注的方法。