Vision-Language-Action (VLA) models build a token-domain robot control paradigm, yet suffer from low speed. Speculative Decoding (SD) is an optimization strategy that can boost inference speed. Two key issues emerge when integrating VLA and SD: first, SD relies on re-inference to address token errors, which is computationally expensive; second, to mitigate token errors, the acceptance threshold in SD requires careful adjustment. Existing works fail to address the above two issues effectively. Meanwhile, as the bridge between AI and the physical world, existing embodied intelligence has overlooked the application of robotic kinematics. To address these issues, we innovatively combine token-domain VLA models with kinematic-domain prediction for SD, proposing a kinematic-rectified SD framework named KERV. We employ a kinematics-based Kalman Filter to predict actions and compensate for SD errors, avoiding costly re-inference. Moreover, we design a kinematics-based adjustment strategy to dynamically rectify the acceptance threshold, addressing the difficulty of threshold determination. Experimental results across diverse tasks and environments demonstrate that KERV achieves 27%~37% acceleration with nearly no Success Rate loss.
翻译:视觉-语言-动作(VLA)模型构建了一种基于令牌域的机器人控制范式,但其推理速度较低。推测解码(SD)是一种能够提升推理速度的优化策略。将VLA与SD结合时出现两个关键问题:首先,SD依赖重新推理来处理令牌错误,计算成本高昂;其次,为减少令牌错误,SD中的接受阈值需要仔细调整。现有工作未能有效解决上述两个问题。同时,作为人工智能与物理世界之间的桥梁,现有具身智能研究忽视了机器人运动学的应用。为解决这些问题,我们创新性地将令牌域VLA模型与运动学域预测相结合用于SD,提出了一种名为KERV的运动学矫正SD框架。我们采用基于运动学的卡尔曼滤波器来预测动作并补偿SD误差,从而避免了昂贵的重新推理。此外,我们设计了一种基于运动学的调整策略来动态矫正接受阈值,解决了阈值确定的难题。在多种任务和环境中的实验结果表明,KERV在几乎不损失成功率的情况下实现了27%~37%的加速。