Multimodal transformer exhibits high capacity and flexibility to align image and text for visual grounding. However, the existing encoder-only grounding framework (e.g., TransVG) suffers from heavy computation due to the self-attention operation with quadratic time complexity. To address this issue, we present a new multimodal transformer architecture, coined as Dynamic Mutilmodal DETR (Dynamic MDETR), by decoupling the whole grounding process into encoding and decoding phases. The key observation is that there exists high spatial redundancy in images. Thus, we devise a new dynamic multimodal transformer decoder by exploiting this sparsity prior to speed up the visual grounding process. Specifically, our dynamic decoder is composed of a 2D adaptive sampling module and a text guided decoding module. The sampling module aims to select these informative patches by predicting the offsets with respect to a reference point, while the decoding module works for extracting the grounded object information by performing cross attention between image features and text features. These two modules are stacked alternatively to gradually bridge the modality gap and iteratively refine the reference point of grounded object, eventually realizing the objective of visual grounding. Extensive experiments on five benchmarks demonstrate that our proposed Dynamic MDETR achieves competitive trade-offs between computation and accuracy. Notably, using only 9% feature points in the decoder, we can reduce ~44% GFLOPs of the multimodal transformer, but still get higher accuracy than the encoder-only counterpart. In addition, to verify its generalization ability and scale up our Dynamic MDETR, we build the first one-stage CLIP empowered visual grounding framework, and achieve the state-of-the-art performance on these benchmarks.
翻译:多模态Transformer在图像与文本的视觉定位对齐中展现出强大的容量和灵活性。然而,现有仅编码器框架(如TransVG)由于自注意力操作的二次时间复杂度而面临高计算开销问题。为解决这一难题,我们提出新型多模态Transformer架构——动态多模态DETR(Dynamic MDETR),将整个定位过程解耦为编码与解码两个阶段。关键发现是图像中存在显著的空间冗余性,因此我们利用这一稀疏先验设计新型动态多模态Transformer解码器以加速视觉定位。具体地,该动态解码器由2D自适应采样模块与文本引导解码模块组成:采样模块通过预测相对于参考点的偏移量来选取信息性图像块,而解码模块通过图像特征与文本特征的交叉注意力提取目标物体信息。两个模块交替堆叠,逐步弥合模态差异并迭代优化目标物体的参考点,最终实现视觉定位目标。在五个基准上的大量实验表明,所提动态MDETR在计算量与精度之间取得了具有竞争力的平衡。值得注意的是,仅使用解码器中9%的特征点,即可减少多模态Transformer约44%的GFLOPs,同时仍获得优于仅编码器方案的精度。此外,为验证泛化能力并扩展动态MDETR,我们构建了首个单阶段CLIP赋能的视觉定位框架,并在这些基准上实现了最优性能。