Vision-Language-Action (VLA) models are promising for generalist robot manipulation but remain brittle in out-of-distribution (OOD) settings, especially with limited real-robot data. To resolve the generalization bottleneck, we introduce a hierarchical Vision-Language-Action framework \our{} that leverages the generalization of large-scale pre-trained world model for robust and generalizable VIsual Subgoal TAsk decomposition VISTA. Our hierarchical framework \our{} consists of a world model as the high-level planner and a VLA as the low-level executor. The high-level world model first divides manipulation tasks into subtask sequences with goal images, and the low-level policy follows the textual and visual guidance to generate action sequences. Compared to raw textual goal specification, these synthesized goal images provide visually and physically grounded details for low-level policies, making it feasible to generalize across unseen objects and novel scenarios. We validate both visual goal synthesis and our hierarchical VLA policies in massive out-of-distribution scenarios, and the performance of the same-structured VLA in novel scenarios could boost from 14% to 69% with the guidance generated by the world model. Results demonstrate that our method outperforms previous baselines with a clear margin, particularly in out-of-distribution scenarios. Project page: \href{https://vista-wm.github.io/}{https://vista-wm.github.io}
翻译:视觉-语言-动作(VLA)模型在通用机器人操作任务中展现出潜力,但在分布外(OOD)场景中,尤其是在真实机器人数据有限的情况下,其鲁棒性仍然不足。为解决泛化瓶颈,我们提出了一个层次化的视觉-语言-动作框架 \our{},该框架利用大规模预训练世界模型的泛化能力,实现鲁棒且可泛化的视觉子目标任务分解(VISTA)。我们的层次化框架 \our{} 包含一个作为高层规划器的世界模型和一个作为低层执行器的 VLA 模型。高层世界模型首先将操作任务分解为带有目标图像的子任务序列,低层策略则遵循文本和视觉引导生成动作序列。与原始的文本目标描述相比,这些合成的目标图像为低层策略提供了视觉和物理层面的具体细节,使其能够泛化至未见过的物体和新颖场景。我们在大量分布外场景中验证了视觉目标合成及层次化 VLA 策略的有效性,结果表明,在世界模型生成的引导下,相同结构的 VLA 模型在新场景中的性能可从 14% 提升至 69%。实验证明,我们的方法明显优于现有基线,尤其在分布外场景中表现突出。项目页面:\href{https://vista-wm.github.io}{https://vista-wm.github.io}