Vision-Language-Action (VLA) models demonstrate remarkable potential for generalizable robotic manipulation. The performance of VLA models can be improved by integrating with action chunking, a critical technique for effective control. However, action chunking linearly scales up action dimensions in VLA models with increased chunking sizes. This reduces the inference efficiency. To tackle this problem, we propose PD-VLA, the first parallel decoding framework for VLA models integrated with action chunking. Our framework reformulates autoregressive decoding as a nonlinear system solved by parallel fixed-point iterations. This approach preserves model performance with mathematical guarantees while significantly improving decoding speed. In addition, it enables training-free acceleration without architectural changes, as well as seamless synergy with existing acceleration techniques. Extensive simulations validate that our PD-VLA maintains competitive success rates while achieving 2.52 times execution frequency on manipulators (with 7 degrees of freedom) compared with the fundamental VLA model. Furthermore, we experimentally identify the most effective settings for acceleration. Finally, real-world experiments validate its high applicability across different tasks.
翻译:视觉-语言-动作(VLA)模型在可泛化的机器人操控任务中展现出显著潜力。通过与动作分块技术相结合,VLA模型的性能可得到进一步提升,该技术是实现高效控制的关键方法。然而,动作分块会随分块尺寸的增加而线性扩展VLA模型的动作维度,从而降低推理效率。为解决此问题,我们提出PD-VLA,这是首个面向集成动作分块的VLA模型的并行解码框架。该框架将自回归解码重构为可通过并行定点迭代求解的非线性系统。此方法在保持模型性能(具备数学保证)的同时,显著提升了解码速度。此外,该框架无需修改模型架构即可实现免训练加速,并能与现有加速技术无缝协同。大量仿真实验验证了PD-VLA在保持竞争力的任务成功率的同时,在七自由度机械臂上实现了相较于基础VLA模型2.52倍的执行频率。进一步地,我们通过实验确定了最优的加速配置。最后,真实世界实验验证了其在多种任务中的高度适用性。