Vision-Language-Action (VLA) Models have become the mainstream solution for robot control, but suffer from slow inference speeds. Speculative Decoding (SD) is a promising acceleration method which can be divided into two categories: drafter-based SD and retrieval-based SD. Each of the two methods demonstrates complementary advantages and limitations when applied to VLA models, leading to the hypothesis that a hybrid approach integrating these two methods will yield better performance. In this paper, we first conduct a series of detailed analyses to reveal the advantages and feasibility of hybrid utilization. However, even with the aforementioned key insights, implementing hybrid SD in VLA models presents several challenges: (1) draft rejection and persistent errors in retrieval-based SD; (2) difficulty in determining the hybrid boundary. To address these, we propose the HeiSD framework. We propose a retrieval-based SD optimization method in HeiSD, which contains a verify-skip mechanism and a sequence-wise relaxed acceptance strategy. Moreover, we proposed a kinematic-based fused metric in HeiSD to automatically determine the hybrid boundary. Experimental results demonstrate that HeiSD attains a speedup of up to 2.45x in simulation benchmarks and 2.06x~2.41x in real-world scenarios, while sustaining a high task success rate.
翻译:视觉-语言-动作(VLA)模型已成为机器人控制的主流解决方案,但面临推理速度慢的问题。投机解码(SD)是一种有前景的加速方法,可分为两类:基于起草者的SD和基于检索的SD。这两类方法在应用于VLA模型时展现出互补的优势与局限,这一现象启发我们提出假设:融合两种方法的混合策略将取得更优性能。本文首先通过一系列详细分析揭示混合策略的优势与可行性。然而,即使获得上述关键洞见,在VLA模型中实现混合SD仍存在多重挑战:(1)基于检索的SD存在草稿拒绝与持续性错误;(2)混合边界难以确定。针对这些问题,我们提出HeiSD框架。该框架包含基于检索的SD优化方法,具体设计了验证跳过机制与序列级宽松接受策略。此外,我们提出基于运动学的融合指标以自动确定混合边界。实验结果表明,HeiSD在仿真基准测试中实现最高2.45倍加速,在真实场景中实现2.06~2.41倍加速,同时保持高任务成功率。