Vision-Language-Action (VLA) Models have become the mainstream solution for robot control, but suffer from slow inference speeds. Speculative Decoding (SD) is a promising acceleration method which can be divided into two categories: drafter-based SD and retrieval-based SD. Existing methods fail to analyze the advantages and disadvantages of these two types of SD in VLA models, leading to their sole application or optimization. In this paper, we analyze the trajectory patterns of robots controlled by the VLA model and derive a key insight: the two types of SD should be used in a hybrid manner. However, achieving hybrid SD in VLA models poses several challenges: (1) draft rejection and persistent errors in retrieval-based SD; (2) difficulty in determining the hybrid boundary. To address these, we propose the HeiSD framework. We propose a retrieval-based SD optimization method in HeiSD,which contains a verify-skip mechanism and a sequence-wise relaxed acceptance strategy. Moreover, we proposed a kinematic-based fused metric in HeiSD to automatically determine the hybrid boundary. Experimental results demonstrate that HeiSD attains a speedup of up to 2.45x in simulation benchmarks and 2.06x~2.41x in real-world scenarios, while sustaining a high task success rate.
翻译:视觉-语言-动作模型已成为机器人控制的主流解决方案,但存在推理速度慢的问题。推测解码是一种有前景的加速方法,可分为两类:基于草稿模型的推测解码和基于检索的推测解码。现有方法未分析这两类推测解码在视觉-语言-动作模型中的优缺点,导致仅进行单一应用或优化。本文分析了视觉-语言-动作模型控制下机器人的轨迹模式,并得出关键结论:两类推测解码应混合使用。然而,在视觉-语言-动作模型中实现混合推测解码面临多项挑战:(1) 基于检索的推测解码中的草稿拒绝与持续性错误;(2) 混合边界的判定困难。为解决这些问题,我们提出HeiSD框架。我们在HeiSD中提出了一种基于检索的推测解码优化方法,包含验证跳过机制和序列级宽松接受策略。此外,我们提出了一种基于运动学的融合指标,用于自动确定混合边界。实验结果表明,HeiSD在仿真基准测试中实现了最高2.45倍的加速,在现实场景中达到2.06倍至2.41倍的加速,同时保持了较高的任务成功率。