CrossGET: Cross-Guided Ensemble of Tokens for Accelerating Vision-Language Transformers

Recent vision-language models have achieved tremendous progress far beyond what we ever expected. However, their computational costs are also dramatically growing with rapid development, especially for the large models. It makes model acceleration exceedingly critical in a scenario of limited resources. Although extensively studied for unimodal models, the acceleration for multimodal models, especially the vision-language Transformers, is relatively under-explored. To pursue more efficient and accessible vision-language Transformers, this paper introduces \textbf{Cross}-\textbf{G}uided \textbf{E}nsemble of \textbf{T}okens (\textbf{\emph{CrossGET}}), a universal acceleration framework for vision-language Transformers. This framework adaptively combines tokens through real-time, cross-modal guidance, thereby achieving substantial acceleration while keeping high performance. \textit{CrossGET} has two key innovations: 1) \textit{Cross-Guided Matching and Ensemble}. \textit{CrossGET} incorporates cross-modal guided token matching and ensemble to exploit cross-modal information effectively, only introducing cross-modal tokens with negligible extra parameters. 2) \textit{Complete-Graph Soft Matching}. In contrast to the existing bipartite soft matching approach, \textit{CrossGET} introduces a complete-graph soft matching policy to achieve more reliable token-matching results while maintaining parallelizability and high efficiency. Extensive experiments are conducted on various vision-language tasks, including image-text retrieval, visual reasoning, image captioning, and visual question answering. Performance on both classic multimodal architectures and emerging multimodal LLMs demonstrate the effectiveness and versatility of the proposed \textit{CrossGET} framework. The code will be at \url{https://github.com/sdc17/CrossGET}.

翻译：近年来，视觉-语言模型的进展已远超预期。然而，随着模型的快速发展，其计算成本也急剧增加，尤其在大型模型中更为显著。这使得模型加速在资源受限场景中变得至关重要。尽管单模态模型的加速已得到广泛研究，但多模态模型（尤其是视觉-语言Transformer）的加速相对尚未充分探索。为实现更高效、更易部署的视觉-语言Transformer，本文提出**跨**模态**引**导的**Token集成**（**\textit{CrossGET}**），一种通用的视觉-语言Transformer加速框架。该框架通过实时跨模态引导自适应地融合Token，在保持高性能的同时实现显著加速。\textit{CrossGET}包含两项关键创新：1）**跨模态引导匹配与集成**：通过跨模态引导的Token匹配与集成，有效利用跨模态信息，仅引入可忽略的额外参数即实现跨模态Token融合。2）**完全图软匹配**：与现有二分图软匹配方法不同，\textit{CrossGET}提出完全图软匹配策略，在保持可并行化与高效率的同时获得更可靠的Token匹配结果。我们在多种视觉-语言任务（包括图像-文本检索、视觉推理、图像描述和视觉问答）上进行了大量实验。经典多模态架构与新兴多模态大语言模型上的性能均证明了所提\textit{CrossGET}框架的有效性与通用性。代码将发布于\url{https://github.com/sdc17/CrossGET}。