Vision-Language models (VLMs) pre-trained on large corpora have demonstrated notable success across a range of downstream tasks. In light of the rapidly increasing size of pre-trained VLMs, parameter-efficient transfer learning (PETL) has garnered attention as a viable alternative to full fine-tuning. One such approach is the adapter, which introduces a few trainable parameters into the pre-trained models while preserving the original parameters during adaptation. In this paper, we present a novel modeling framework that recasts adapter tuning after attention as a graph message passing process on attention graphs, where the projected query and value features and attention matrix constitute the node features and the graph adjacency matrix, respectively. Within this framework, tuning adapters in VLMs necessitates handling heterophilic graphs, owing to the disparity between the projected query and value space. To address this challenge, we propose a new adapter architecture, $p$-adapter, which employs $p$-Laplacian message passing in Graph Neural Networks (GNNs). Specifically, the attention weights are re-normalized based on the features, and the features are then aggregated using the calibrated attention matrix, enabling the dynamic exploitation of information with varying frequencies in the heterophilic attention graphs. We conduct extensive experiments on different pre-trained VLMs and multi-modal tasks, including visual question answering, visual entailment, and image captioning. The experimental results validate our method's significant superiority over other PETL methods.
翻译:视觉语言模型(VLMs)经过大规模语料库预训练,已在多种下游任务中展现出显著成功。鉴于预训练VLM规模的快速扩张,参数高效迁移学习(PETL)作为全模型微调的有效替代方案备受关注。其中适配器方法通过在预训练模型中引入少量可训练参数,并在适配过程中保持原始参数不变。本文提出一种新颖的建模框架,将注意力机制后的适配器调优重新诠释为注意力图上的图消息传递过程——其中投影后的查询特征、值特征与注意力矩阵分别构成节点特征和图邻接矩阵。在此框架下,由于投影查询空间与值空间存在差异,VLM的适配器调优需要处理异配图。针对这一挑战,我们提出新型适配器架构$p$-adapter,该架构采用图神经网络(GNNs)中的$p$-Laplacian消息传递机制。具体而言,注意力权重基于特征进行重归一化,随后利用校准后的注意力矩阵聚合特征,从而在异配注意力图中动态利用不同频率的信息。我们在不同预训练VLM及多模态任务(包括视觉问答、视觉蕴含和图像描述生成)上开展广泛实验,实验结果验证了本方法相较于其他PETL方法具有显著优势。