In this paper, we propose a novel parameter and computation efficient tuning method for Multi-modal Large Language Models (MLLMs), termed Efficient Attention Skipping (EAS). Concretely, we first reveal that multi-head attentions (MHAs), the main computational overhead of MLLMs, are often redundant to downstream tasks. Based on this observation, EAS evaluates the attention redundancy and skips the less important MHAs to speed up inference. Besides, we also propose a novel propagation-of-information adapter (PIA) to serve the attention skipping of EAS and keep parameter efficiency, which can be further re-parameterized into feed-forward networks (FFNs) for zero-extra latency. To validate EAS, we apply it to a recently proposed MLLM called LaVIN and a classic VL pre-trained model called METER, and conduct extensive experiments on a set of benchmarks. The experiments show that EAS not only retains high performance and parameter efficiency, but also greatly speeds up inference speed. For instance, LaVIN-EAS can obtain 89.98\% accuracy on ScineceQA while speeding up inference by 2.2 times to LaVIN
翻译:本文提出了一种新颖的多模态大语言模型(MLLMs)参数与计算高效微调方法,称为高效注意力跳过(Efficient Attention Skipping, EAS)。具体而言,我们首先揭示多头注意力(MHAs)——作为MLLMs的主要计算开销——通常对下游任务存在冗余。基于这一观察,EAS评估注意力冗余性,并跳过重要性较低的MHAs以加速推理。此外,我们还提出了一种新型信息传播适配器(Propagation-of-Information Adapter, PIA)以支持EAS的注意力跳过机制并保持参数效率,该适配器可进一步重参数化到前馈网络(FFNs)中,实现零额外延迟。为验证EAS的有效性,我们将其应用于近期提出的MLLM模型LaVIN以及经典视觉-语言预训练模型METER,并在多个基准数据集上进行了广泛实验。实验结果表明,EAS不仅保持了高性能与参数效率,还显著提升了推理速度。例如,LaVIN-EAS在ScienceQA数据集上达到89.98%的准确率,同时相比LaVIN实现了2.2倍的推理加速。