Although speculative decoding is widely used to accelerate Vision-Language Models (VLMs) inference, it faces severe performance collapse when applied to Video Large Language Models (Vid-LLMs). The draft model typically falls into the trap of attention dilution and negative visual gain due to key-value cache explosion and context window mismatches. We observe a visual semantic internalization phenomenon in Vid-LLMs, indicating that critical visual semantics are implicitly encoded into text hidden states during deep-layer interactions, which renders raw visual inputs structurally redundant during deep inference. To address this, we propose the Sparrow framework, which first utilizes visually-aware text-anchored window attention via hidden state reuse to fully offload visual computation to the target model, and leverages intermediate-layer visual state bridging to train the draft model with semantic-rich intermediate states, thereby filtering out low-level visual noise. Additionally, a multi-token prediction strategy is introduced to bridge the training-inference distribution shift. Experiments show that Sparrow achieves an average speedup of 2.82x even with 25k visual tokens, effectively resolving the performance degradation in long sequences and offering a practical solution for real-time long video tasks.
翻译:尽管推测解码技术被广泛用于加速视觉语言模型的推理过程,但当应用于视频大语言模型时,其性能会出现严重衰退。由于键值缓存爆炸和上下文窗口不匹配,草稿模型通常会陷入注意力稀释和负向视觉增益的困境。我们观察到视频大语言模型中存在视觉语义内化现象,表明关键的视觉语义在深层交互过程中被隐式编码到文本隐藏状态中,这使得原始视觉输入在深度推理阶段在结构上变得冗余。为解决这一问题,我们提出了麻雀框架:该框架首先通过隐藏状态重用,利用视觉感知的文本锚定窗口注意力,将视觉计算完全卸载至目标模型;同时借助中间层视觉状态桥接,使用富含语义的中间状态训练草稿模型,从而过滤掉低层视觉噪声。此外,我们引入了多令牌预测策略以弥合训练与推理之间的分布偏移。实验表明,即使在处理25k个视觉令牌的情况下,麻雀框架仍能实现平均2.82倍的加速效果,有效解决了长序列场景下的性能退化问题,为实时长视频任务提供了实用解决方案。