Pruning is a typical acceleration technique for compute-bound models by removing computation on unimportant values. Recently, it has been applied to accelerate Vision-Language-Action (VLA) model inference. However, existing acceleration methods focus on local information from the current action step and ignore the global context, leading to >20% success rate drop and limited speedup in some scenarios. In this paper, we point out spatial-temporal consistency in VLA tasks: input images in consecutive steps exhibit high similarity, and propose the key insight that token selection should combine local information with global context of the model. Based on this, we propose SpecPrune-VLA, a training-free, two-level pruning method with heuristic control. (1) Action-level static pruning. We leverage global history and local attention to statically reduce visual tokens per action. (2) Layer-level dynamic pruning. We prune tokens adaptively per layer based on layer-wise importance. (3) Lightweight action-aware controller: We classify actions as coarse- or fine-grained by the speed of the end effector and adjust pruning aggressiveness accordingly. Extensive experiments show that SpecPrune-VLA achieves up to 1.57$\times$ speedup in LIBERO simulation and 1.70$\times$ on real-world tasks, with negligible success rate degradation.
翻译:剪枝是一种典型的加速计算受限模型的技术,其通过移除对不重要值的计算来实现。最近,该技术已被应用于加速视觉-语言-动作模型的推理。然而,现有的加速方法侧重于来自当前动作步骤的局部信息,而忽略了全局上下文,这导致在某些场景下成功率下降超过20%且加速效果有限。本文指出了VLA任务中的时空一致性:连续步骤的输入图像表现出高度相似性,并提出了一个关键见解,即令牌选择应将局部信息与模型的全局上下文相结合。基于此,我们提出了SpecPrune-VLA,一种无需训练、具有启发式控制的双层级剪枝方法。(1) 动作级静态剪枝。我们利用全局历史信息和局部注意力来静态减少每个动作的视觉令牌。(2) 层级动态剪枝。我们根据层级重要性自适应地对每层的令牌进行剪枝。(3) 轻量级动作感知控制器:我们根据末端执行器的速度将动作分类为粗粒度或细粒度,并相应地调整剪枝的激进程度。大量实验表明,SpecPrune-VLA在LIBERO仿真中实现了高达1.57$\times$的加速,在真实世界任务中实现了1.70$\times$的加速,且成功率下降可忽略不计。