Although large vision-language-action (VLA) models pretrained on extensive robot datasets offer promising generalist policies for robotic learning, they still struggle with spatial-temporal dynamics in interactive robotics, making them less effective in handling complex tasks, such as manipulation. In this work, we introduce visual trace prompting, a simple yet effective approach to facilitate VLA models' spatial-temporal awareness for action prediction by encoding state-action trajectories visually. We develop a new TraceVLA model by finetuning OpenVLA on our own collected dataset of 150K robot manipulation trajectories using visual trace prompting. Evaluations of TraceVLA across 137 configurations in SimplerEnv and 4 tasks on a physical WidowX robot demonstrate state-of-the-art performance, outperforming OpenVLA by 10% on SimplerEnv and 3.5x on real-robot tasks and exhibiting robust generalization across diverse embodiments and scenarios. To further validate the effectiveness and generality of our method, we present a compact VLA model based on 4B Phi-3-Vision, pretrained on the Open-X-Embodiment and finetuned on our dataset, rivals the 7B OpenVLA baseline while significantly improving inference efficiency.
翻译:尽管基于大规模机器人数据集预训练的大型视觉-语言-动作模型为机器人学习提供了前景广阔的通用策略,但其在交互式机器人任务中仍难以处理时空动态特性,导致在执行复杂操作任务时效能受限。本研究提出视觉轨迹提示方法,通过视觉化编码状态-动作轨迹,以简洁高效的方式增强VLA模型在动作预测中的时空感知能力。我们基于自行采集的15万条机器人操作轨迹数据集,采用视觉轨迹提示对OpenVLA进行微调,构建了新型TraceVLA模型。在SimplerEnv的137种配置及WidowX实体机器人的4项任务评估中,TraceVLA均取得最先进性能:在SimplerEnv上超越OpenVLA基准10%,在实体机器人任务中达到3.5倍效能提升,并展现出跨不同具身形态与场景的鲁棒泛化能力。为验证方法的有效性与普适性,我们基于4B参数Phi-3-Vision架构构建紧凑型VLA模型,该模型在Open-X-Embodiment数据集预训练后,通过本数据集微调,其性能可媲美7B参数的OpenVLA基线,同时显著提升推理效率。