LinMU: Multimodal Understanding Made Linear

Modern Vision-Language Models (VLMs) achieve impressive performance but are limited by the quadratic complexity of self-attention, which prevents their deployment on edge devices and makes their understanding of high-resolution images and long-context videos prohibitively expensive. To address this challenge, we introduce LinMU (Linear-complexity Multimodal Understanding), a VLM design that achieves linear complexity for the language model decoder without using any quadratic-complexity modules while maintaining the performance of global-attention-based VLMs. LinMU replaces every self-attention layer in the language model decoder with an M-MATE block: a dual-branch module that combines a bidirectional state-space model for global context (Flex-MA branch) with localized Swin-style window attention (Local-Swin branch) for adjacent correlations. To transform a pre-trained VLM into the LinMU architecture, we propose a three-stage distillation framework that (i) initializes both branches with self-attention weights and trains the Flex-MA branch alone, (ii) unfreezes the Local-Swin branch and fine-tunes it jointly with the Flex-MA branch, and (iii) unfreezes the remaining blocks and fine-tunes them using LoRA adapters, while regressing on hidden states and token-level logits of the frozen VLM teacher. On MMMU, TextVQA, LongVideoBench, Video-MME, and other benchmarks, LinMU matches the performance of teacher models, yet reduces Time-To-First-Token (TTFT) by up to 2.7$\times$ and improves token throughput by up to 9.0$\times$ on minute-length videos. Ablations confirm the importance of each distillation stage and the necessity of the two branches of the M-MATE block. The proposed framework demonstrates that state-of-the-art multimodal reasoning can be achieved without quadratic attention, thus opening up avenues for long-context VLMs that can deal with high-resolution images and long videos.

翻译：现代视觉语言模型（VLM）虽性能卓越，但受限于自注意力的二次复杂度，导致其无法部署于边缘设备，且在高分辨率图像和长视频理解中计算成本过高。为应对这一挑战，我们提出LinMU（线性复杂度多模态理解模型），该VLM架构在不使用任何二次复杂度模块的前提下，实现语言模型解码器的线性复杂度，同时保持基于全局注意力的VLM性能。LinMU将语言模型解码器中的每个自注意力层替换为M-MATE模块：一种双分支结构，其结合用于全局上下文的双向状态空间模型（Flex-MA分支）与用于局部关联的Swin式窗口注意力（Local-Swin分支）。为将预训练VLM转化为LinMU架构，我们提出三阶段蒸馏框架：（i）用自注意力权重初始化两个分支并单独训练Flex-MA分支；（ii）解冻Local-Swin分支并与Flex-MA分支联合微调；（iii）解冻其余模块并通过LoRA适配器进行微调，同时回归冻结VLM教师模型的隐藏状态与token级逻辑值。在MMMU、TextVQA、LongVideoBench、Video-MME等基准测试中，LinMU与教师模型性能持平，但在分钟级视频任务中，首次生成时间（TTFT）降低高达2.7倍，token吞吐量提升高达9.0倍。消融实验证实了各蒸馏阶段的重要性及M-MATE模块双分支的必要性。该框架表明，无需二次注意力即可实现最先进的多模态推理，为处理高分辨率图像与长视频的长上下文VLM开辟新路径。