Modern Vision-Language Models (VLMs) achieve impressive performance but are limited by the quadratic complexity of self-attention, which prevents their deployment on edge devices and makes their understanding of high-resolution images and long-context videos prohibitively expensive. To address this challenge, we introduce LinMU (Linear-complexity Multimodal Understanding), a VLM design that achieves linear complexity for the language model decoder without using any quadratic-complexity modules while maintaining the performance of global-attention-based VLMs. LinMU replaces every self-attention layer in the language model decoder with an M-MATE block: a dual-branch module that combines a bidirectional state-space model for global context (Flex-MA branch) with localized Swin-style window attention (Local-Swin branch) for adjacent correlations. To transform a pre-trained VLM into the LinMU architecture, we propose a three-stage distillation framework that (i) initializes both branches with self-attention weights and trains the Flex-MA branch alone, (ii) unfreezes the Local-Swin branch and fine-tunes it jointly with the Flex-MA branch, and (iii) unfreezes the remaining blocks and fine-tunes them using LoRA adapters, while regressing on hidden states and token-level logits of the frozen VLM teacher. On MMMU, TextVQA, LongVideoBench, Video-MME, and other benchmarks, LinMU matches the performance of teacher models, yet reduces Time-To-First-Token (TTFT) by up to 2.7$\times$ and improves token throughput by up to 9.0$\times$ on minute-length videos. Ablations confirm the importance of each distillation stage and the necessity of the two branches of the M-MATE block. The proposed framework demonstrates that state-of-the-art multimodal reasoning can be achieved without quadratic attention, thus opening up avenues for long-context VLMs that can deal with high-resolution images and long videos.
翻译:现代视觉语言模型(VLM)虽性能卓越,但受限于自注意力的二次复杂度,导致其无法部署于边缘设备,且在高分辨率图像和长视频理解中计算成本过高。为应对这一挑战,我们提出LinMU(线性复杂度多模态理解模型),该VLM架构在不使用任何二次复杂度模块的前提下,实现语言模型解码器的线性复杂度,同时保持基于全局注意力的VLM性能。LinMU将语言模型解码器中的每个自注意力层替换为M-MATE模块:一种双分支结构,其结合用于全局上下文的双向状态空间模型(Flex-MA分支)与用于局部关联的Swin式窗口注意力(Local-Swin分支)。为将预训练VLM转化为LinMU架构,我们提出三阶段蒸馏框架:(i)用自注意力权重初始化两个分支并单独训练Flex-MA分支;(ii)解冻Local-Swin分支并与Flex-MA分支联合微调;(iii)解冻其余模块并通过LoRA适配器进行微调,同时回归冻结VLM教师模型的隐藏状态与token级逻辑值。在MMMU、TextVQA、LongVideoBench、Video-MME等基准测试中,LinMU与教师模型性能持平,但在分钟级视频任务中,首次生成时间(TTFT)降低高达2.7倍,token吞吐量提升高达9.0倍。消融实验证实了各蒸馏阶段的重要性及M-MATE模块双分支的必要性。该框架表明,无需二次注意力即可实现最先进的多模态推理,为处理高分辨率图像与长视频的长上下文VLM开辟新路径。