Bridging the Semantic-Action Gap in Visual Token Pruning for Efficient VLA Inference

Vision-Language-Action (VLA) models have shown great potential for embodied AI by integrating visual perception, language understanding, and action execution. In real-time deployment, these models must process continuous visual streams, incurring substantial computational overhead. Visual token pruning -- a mainstream technique for accelerating Vision-Language Models (VLMs) by retaining salient tokens while discarding redundant ones -- offers a natural candidate solution to this challenge. However, directly applying VLM-oriented pruning methods to VLA inference can cause severe degradation in manipulation performance. Our analysis attributes this degradation to a key mismatch: VLA inference exhibits distinct attention patterns between the vision-language prefill stage and the action-decode stage, so pruning based only on context-prefill semantic salience is biased toward semantic cues and may remove action-critical visual tokens. Motivated by this observation, we propose VLA-Pruner, an effective plug-and-play token pruning method grounded in the visual requirements of VLA inference, further exploiting the temporal continuity of robot manipulation. Specifically, VLA-Pruner estimates visual-token importance from both semantic prefilling and temporally smoothed action relevance, and then applies a Combine-then-Filter strategy to retain compact, non-redundant tokens under the compute budget. Experiments show that VLA-Pruner outperforms state-of-the-art approaches across multiple VLA architectures, achieving up to 1.99x speedup with comparable manipulation quality.

翻译：视觉-语言-动作模型通过整合视觉感知、语言理解和动作执行，在具身人工智能领域展现出巨大潜力。在实时部署中，这些模型需处理连续视觉流，导致巨大计算开销。视觉标记剪枝——通过保留关键标记并丢弃冗余标记来加速视觉-语言模型的主流技术——为解决该挑战提供了天然候选方案。然而，将面向视觉语言模型的剪枝方法直接应用于视觉语言动作推理会导致操作性能严重下降。我们的分析将这种性能退化归因于关键不匹配：视觉语言动作推理在视觉语言预填充阶段与动作解码阶段表现出不同的注意力模式，因此仅基于上下文预填充语义显著性进行剪枝会偏向语义线索，并可能移除对动作关键的视觉标记。基于这一发现，我们提出VLA-Pruner——一种基于视觉语言动作推理视觉需求的有效即插即用标记剪枝方法，进一步利用机器人操作的时间连续性。具体而言，VLA-Pruner从语义预填充和时间平滑的动作相关性两方面估计视觉标记重要性，随后采用“先合并后过滤”策略在计算预算下保留紧凑且无冗余的标记。实验表明，VLA-Pruner在多种视觉语言动作架构上均优于现有方法，在保持相当操作质量的同时实现最高1.99倍加速。