Vision-Language-Action (VLA) models build a token-domain robot control paradigm, yet suffer from low speed. Speculative Decoding (SD) is an optimization strategy that can boost inference speed. Two key issues emerge when integrating VLA and SD: first, SD relies on re-inference to address token errors, which is computationally expensive; second, to mitigate token errors, the acceptance threshold in SD requires careful adjustment. Existing works fail to address the above two issues effectively. Meanwhile, as the bridge between AI and the physical world, existing embodied intelligence has overlooked the application of robotic kinematics. To address these issues, we innovatively combine token-domain VLA models with kinematic-domain prediction for SD, proposing a kinematic-rectified SD framework named KERV. We employ a kinematics-based Kalman Filter to predict actions and compensate for SD errors, avoiding costly re-inference. Moreover, we design a kinematics-based adjustment strategy to dynamically rectify the acceptance threshold, addressing the difficulty of threshold determination. Experimental results across diverse tasks and environments demonstrate that KERV achieves 27%~37% acceleration with nearly no Success Rate loss.
翻译:摘要:视觉-语言-动作(VLA)模型构建了基于词元域的机器人控制范式,但其推理速度较慢。投机解码(SD)是一种可提升推理速度的优化策略。然而,在整合VLA与SD时出现两个关键问题:其一,SD依赖重新推理来修正词元错误,计算成本高;其二,为减少词元错误,SD中的接受阈值需要精细调节。现有工作未能有效解决上述两大问题。同时,作为连接人工智能与物理世界的桥梁,现有具身智能技术忽视了机器人运动学知识的应用。针对这些问题,我们创新性地将词元域VLA模型与运动学域预测相结合用于SD,提出名为KERV的运动学修正型SD框架。我们采用基于运动学的卡尔曼滤波器预测动作并补偿SD误差,避免高昂的重新推理成本。此外,我们设计了一种基于运动学的阈值动态修正策略,解决了阈值确定难题。跨不同任务与环境的实验结果表明,KERV在成功率几乎无损失的情况下实现了27%~37%的加速。