Vision-language pre-training (VLP) has recently proven highly effective for various uni- and multi-modal downstream applications. However, most existing end-to-end VLP methods use high-resolution image-text box data to perform well on fine-grained region-level tasks, such as object detection, segmentation, and referring expression comprehension. Unfortunately, such high-resolution images with accurate bounding box annotations are expensive to collect and use for supervision at scale. In this work, we propose VoLTA (Vision-Language Transformer with weakly-supervised local-feature Alignment), a new VLP paradigm that only utilizes image-caption data but achieves fine-grained region-level image understanding, eliminating the use of expensive box annotations. VoLTA adopts graph optimal transport-based weakly-supervised alignment on local image patches and text tokens to germinate an explicit, self-normalized, and interpretable low-level matching criterion. In addition, VoLTA pushes multi-modal fusion deep into the uni-modal backbones during pre-training and removes fusion-specific transformer layers, further reducing memory requirements. Extensive experiments on a wide range of vision- and vision-language downstream tasks demonstrate the effectiveness of VoLTA on fine-grained applications without compromising the coarse-grained downstream performance, often outperforming methods using significantly more caption and box annotations.
翻译:视觉-语言预训练(VLP)近年来在各种单模态与多模态下游任务中展现出显著效果。然而,现有端到端VLP方法大多依赖高分辨率图像-文本边框数据,方能有效完成细粒度区域级任务(如目标检测、分割及指代表达理解)。但此类高分辨率图像及其精确边框标注在规模化采集与监督应用中成本高昂。本文提出VoLTA(基于弱监督局部特征对齐的视觉-语言Transformer)——一种仅利用图像-文本描述数据即可实现细粒度区域级图像理解的新型VLP范式,彻底免除了昂贵的边框标注需求。VoLTA通过在局部图像块与文本令牌间引入基于图最优传输的弱监督对齐机制,培育出显式、自归一化且可解释的低层匹配准则。此外,VoLTA在预训练阶段将多模态融合深度注入单模态骨干网络,并移除融合专用Transformer层,进一步降低内存需求。在涵盖视觉与视觉-语言领域的广泛下游任务上的实验表明,VoLTA在细粒度应用中表现优异且不损粗粒度任务性能,其效果常超越采用更多描述与边框标注的方法。