In the realm of Multimodal Large Language Models (MLLMs), vision-language connector plays a crucial role to link the pre-trained vision encoders with Large Language Models (LLMs). Despite its importance, the vision-language connector has been relatively less explored. In this study, we aim to propose a strong vision-language connector that enables MLLMs to achieve high accuracy while maintain low computation cost. We first reveal the existence of the visual anchors in Vision Transformer and propose a cost-effective search algorithm to extract them. Building on these findings, we introduce the Anchor Former (AcFormer), a novel vision-language connector designed to leverage the rich prior knowledge obtained from these visual anchors during pretraining, guiding the aggregation of information. Through extensive experimentation, we demonstrate that the proposed method significantly reduces computational costs by nearly two-thirds compared with baseline, while simultaneously outperforming baseline methods. This highlights the effectiveness and efficiency of AcFormer. Codes are available at https://github.com/liuhaogeng/Anchor-Former.
翻译:在多模态大语言模型(MLLMs)领域,视觉-语言连接器在连接预训练的视觉编码器与大语言模型(LLMs)方面起着至关重要的作用。尽管其重要性不言而喻,但视觉-语言连接器的研究相对较少。在本研究中,我们旨在提出一种强大的视觉-语言连接器,使MLLMs能够在保持低计算成本的同时实现高精度。我们首先揭示了Vision Transformer中视觉锚点的存在,并提出了一种经济高效的搜索算法来提取它们。基于这些发现,我们引入了Anchor Former(AcFormer),这是一种新颖的视觉-语言连接器,旨在利用从这些视觉锚点在预训练过程中获得的丰富先验知识,指导信息的聚合。通过大量实验,我们证明所提出的方法相较于基线,计算成本显著降低了近三分之二,同时性能优于基线方法。这突显了AcFormer的有效性和高效性。代码可在 https://github.com/liuhaogeng/Anchor-Former 获取。