In multimodal large language models (MLLMs), inference cost is largely dominated by the visual token prefix rather than the language backbone, making token reduction a key factor for improving efficiency. Existing approaches typically assign independent importance scores to visual tokens and retain a fixed number of top-ranked tokens, implicitly assuming token independence and a uniform compression ratio across inputs. In this work, we reformulate visual token pruning as a sequential decision-making process. Specifically, we introduce a pointer-style selection mechanism that iteratively chooses informative tokens, conditioning each decision on previously selected ones, and dynamically determines when to stop via a learned termination action. This enables joint optimization of both the selected subset and its size. To enable end-to-end training under standard language modeling objectives, we design a differentiable relaxation based on a variance-preserving noise interpolation scheme, allowing gradients to propagate through the discrete selection process. Extensive experiments on LLaVA-v1.5-7B and Qwen2.5-VL-7B demonstrate that our approach consistently outperforms fixed-ratio baselines across different compression levels. Under aggressive pruning that removes 88.9% of visual tokens, our method preserves 94.6% of the original accuracy while achieving a 1.88x speed-up in prefill latency.
翻译:在多模态大语言模型(MLLMs)中,推理成本主要取决于视觉令牌前缀而非语言主干,这使得令牌缩减成为提升效率的关键因素。现有方法通常为视觉令牌分配独立重要性分数,并保留固定数量的排名靠前令牌,这隐含假设了令牌独立性和跨输入的统一压缩比率。本文创新性地将视觉令牌剪枝重新表述为序贯决策过程:具体引入指针式选择机制,通过迭代方式选取信息性令牌,使每次决策均基于先前已选令牌,并借助学习到的终止动作动态确定停止时机。该机制可同时优化所选子集及其规模。为实现标准语言建模目标下的端到端训练,我们基于方差保持噪声插值方案设计了可微分松弛方法,使梯度能够通过离散选择过程传播。在LLaVA-v1.5-7B和Qwen2.5-VL-7B上的大量实验表明,该方法在不同压缩水平下始终优于固定比率基线。当采用激进剪枝策略移除88.9%的视觉令牌时,本方法在保持94.6%原始准确率的同时,实现了预填充延迟1.88倍的加速。