In this paper, we propose a novel parameter and computation efficient tuning method for Multi-modal Large Language Models (MLLMs), termed Efficient Attention Skipping (EAS). Concretely, we first reveal that multi-head attentions (MHAs), the main computational overhead of MLLMs, are often redundant to downstream tasks. Based on this observation, EAS evaluates the attention redundancy and skips the less important MHAs to speed up inference. Besides, we also propose a novel propagation-of-information adapter (PIA) to serve the attention skipping of EAS and keep parameter efficiency, which can be further re-parameterized into feed-forward networks (FFNs) for zero-extra latency. To validate EAS, we apply it to a recently proposed MLLM called LaVIN and a classic VL pre-trained model called METER, and conduct extensive experiments on a set of benchmarks. The experiments show that EAS not only retains high performance and parameter efficiency, but also greatly speeds up inference speed. For instance, LaVIN-EAS can obtain 89.98\% accuracy on ScineceQA while speeding up inference by 2.2 times to LaVIN
翻译:本文提出一种新颖的参数与计算高效调优方法,适用于多模态大语言模型,称为高效注意力跳跃机制。具体而言,我们首先揭示多头注意力机制作为多模态大语言模型的主要计算开销,在下游任务中常存在冗余性。基于此发现,高效注意力跳跃机制通过评估注意力冗余度,跳过次要的多头注意力模块以加速推理过程。此外,我们提出一种创新的信息传播适配器,服务于高效注意力跳跃机制的注意力跳过操作并保持参数效率,该适配器可进一步重参数化为前馈网络以实现零额外延迟。为验证高效注意力跳跃机制,我们将其应用于最新提出的多模态大语言模型LaVIN以及经典视觉语言预训练模型METER,并在系列基准测试中开展大量实验。实验表明,高效注意力跳跃机制不仅保持高性能与参数效率,同时显著提升推理速度。例如,LaVIN-高效注意力跳跃机制在ScienceQA基准上获得89.98%准确率,推理速度较原始LaVIN提升2.2倍。