Large language models and large multimodal models (LLMs and LMMs) deliver strong generative performance but suffer from slow decoding, a problem that becomes more severe when handling visual inputs, whose sequences typically contain many more tokens with lower information density than text. Speculative decoding accelerates LLM inference by letting a compact draft model propose candidate tokens that are selectively accepted by a larger target model, achieving speed-up without degrading quality. However, existing multimodal speculative decoding approaches largely ignore the structural characteristics of visual representations and usually rely on text-only draft models. In this paper, we introduce SpecFLASH, a speculative decoding framework tailored to LMMs that explicitly exploits multimodal structure when designing the draft model. We first mitigate redundancy in visual token sequences with a lightweight, latent-guided token compression module that compacts visual features while preserving semantics, and then leverage the co-occurrence and local correlations of visual entities via a semi-autoregressive decoding scheme that predicts multiple tokens in a single forward pass. Extensive experiments demonstrate that SpecFLASH consistently surpasses prior speculative decoding baselines, achieving up to $2.68\times$ speed-up on video captioning and $2.55\times$ on visual instruction tuning, relative to the original LMM. Our code is available here: https://github.com/ZihuaEvan/FlashSD/.
翻译:大型语言模型与大型多模态模型(LLMs 和 LMMs)虽具备强大的生成能力,但其解码速度缓慢的问题日益凸显,尤其在处理视觉输入时更为严重——视觉序列通常包含比文本更多且信息密度更低的标记。推测解码通过让一个紧凑的草稿模型生成候选标记,并由更大的目标模型选择性接受,从而在不损失生成质量的前提下加速 LLM 推理。然而,现有的多模态推测解码方法大多忽视了视觉表示的结构特性,且通常仅依赖纯文本草稿模型。本文提出 SpecFLASH,一种专为 LMMs 设计的推测解码框架,其在设计草稿模型时显式利用了多模态结构。我们首先通过一个轻量级的潜在引导标记压缩模块来减少视觉标记序列中的冗余,在保持语义的同时压缩视觉特征;随后借助半自回归解码方案,利用视觉实体的共现与局部相关性,在单次前向传播中预测多个标记。大量实验表明,SpecFLASH 在多项任务上持续优于先前的推测解码基线方法,在视频描述任务中相对原始 LMM 实现了最高 $2.68\times$ 的加速,在视觉指令微调任务中达到 $2.55\times$ 加速。我们的代码已开源:https://github.com/ZihuaEvan/FlashSD/。