Multimodal large language models (MLLMs) have demonstrated impressive performance in various vision-language (VL) tasks, but their expensive computations still limit the real-world application. To address this issue, recent efforts aim to compress the visual features to save the computational costs of MLLMs. However, direct visual compression methods, e.g. efficient projectors, inevitably destroy the visual semantics in MLLM, especially in difficult samples. To overcome this shortcoming, we propose a novel dynamic pyramid network (DPN) for efficient MLLMs. Specifically, DPN formulates MLLM as a hierarchical structure where visual features are gradually compressed with increasing depth. In this case, even with a high compression ratio, fine-grained visual information can still be perceived in shallow layers. To maximize the benefit of DPN, we further propose an innovative Dynamic Pooling Experts (DPE) that can dynamically choose the optimal visual compression rate according to input features. With this design, harder samples will be assigned larger computations, thus preserving the model performance. To validate our approach, we conduct extensive experiments on two popular MLLMs and ten benchmarks. Experimental results show that DPN can save up to 56% average FLOPs on LLaVA while further achieving +0.74% performance gains. Besides, the generalization ability of DPN is also validated on the existing high-resolution MLLM called LLaVA-HR. The source code will be released at https://github.com/aihao2000/DPN-LLaVA.
翻译:多模态大语言模型(MLLMs)在各种视觉-语言(VL)任务中展现出卓越性能,但其高昂的计算成本仍限制了实际应用。为应对这一问题,近期研究致力于压缩视觉特征以降低MLLMs的计算开销。然而,直接的视觉压缩方法(如高效投影器)不可避免地会破坏MLLM中的视觉语义信息,尤其在困难样本中更为显著。为克服此缺陷,本文提出一种新颖的动态金字塔网络(DPN)以实现高效MLLMs。具体而言,DPN将MLLM构建为分层结构,其中视觉特征随网络深度增加而逐步压缩。在此架构下,即使采用高压缩比,细粒度视觉信息仍可在浅层网络中得以感知。为最大化DPN的优势,我们进一步提出创新的动态池化专家(DPE)模块,能够根据输入特征动态选择最优视觉压缩率。通过该设计,困难样本将被分配更多计算资源,从而保持模型性能。为验证所提方法,我们在两种主流MLLMs和十个基准数据集上进行了广泛实验。实验结果表明,DPN在LLaVA模型上最高可节省56%的平均FLOPs,同时进一步实现+0.74%的性能提升。此外,DPN的泛化能力也在现有高分辨率MLLM(LLaVA-HR)上得到验证。源代码将发布于https://github.com/aihao2000/DPN-LLaVA。