In this paper, we show how to run pi0-level multi-view VLA at 30Hz frame rate and at most 480Hz trajectory frequency using a single consumer GPU. This enables dynamic and real-time tasks that were previously believed to be unattainable by large VLA models. To achieve it, we introduce a bag of strategies to eliminate the overheads in model inference. The real-world experiment shows that the pi0 policy with our strategy achieves a 100% success rate in grasping a falling pen task. Based on the results, we further propose a full streaming inference framework for real-time robot control of VLA. Code is available at https://github.com/Dexmal/realtime-vla.
翻译:本文展示了如何利用单个消费级GPU,以30Hz帧率和最高480Hz轨迹频率运行π0级多视角视觉语言动作模型。这实现了以往被认为大型VLA模型无法完成的动态实时任务。为实现这一目标,我们引入了一系列策略来消除模型推理中的开销。真实世界实验表明,采用我们策略的π0策略在抓取下落笔任务中实现了100%的成功率。基于这些结果,我们进一步提出了用于VLA实时机器人控制的完整流式推理框架。代码发布于https://github.com/Dexmal/realtime-vla。