规模化世界模型驱动的层次化操作策略 (Scaling World Model for Hierarchical Manipulation Policies)

Qian Long,Yueze Wang,Jiaxi Song,Junbo Zhang,Peiyan Li,Wenxuan Wang,Yuqi Wang,Haoyang Li,Shaoxuan Xie,Guocai Yao,Hanbo Zhang,Xinlong Wang,Zhongyuan Wang,Xuguang Lan,Huaping Liu,Xinghang Li

Vision-Language-Action (VLA) models are promising for generalist robot manipulation but remain brittle in out-of-distribution (OOD) settings, especially with limited real-robot data. To resolve the generalization bottleneck, we introduce a hierarchical Vision-Language-Action framework \our{} that leverages the generalization of large-scale pre-trained world model for robust and generalizable VIsual Subgoal TAsk decomposition VISTA. Our hierarchical framework \our{} consists of a world model as the high-level planner and a VLA as the low-level executor. The high-level world model first divides manipulation tasks into subtask sequences with goal images, and the low-level policy follows the textual and visual guidance to generate action sequences. Compared to raw textual goal specification, these synthesized goal images provide visually and physically grounded details for low-level policies, making it feasible to generalize across unseen objects and novel scenarios. We validate both visual goal synthesis and our hierarchical VLA policies in massive out-of-distribution scenarios, and the performance of the same-structured VLA in novel scenarios could boost from 14% to 69% with the guidance generated by the world model. Results demonstrate that our method outperforms previous baselines with a clear margin, particularly in out-of-distribution scenarios. Project page: \href{https://vista-wm.github.io/}{https://vista-wm.github.io}

翻译：视觉-语言-动作（VLA）模型在通用机器人操作任务中展现出潜力，但在分布外（OOD）场景中，尤其是在真实机器人数据有限的情况下，其鲁棒性仍然不足。为解决泛化瓶颈，我们提出了一个层次化的视觉-语言-动作框架 \our{}，该框架利用大规模预训练世界模型的泛化能力，实现鲁棒且可泛化的视觉子目标任务分解（VISTA）。我们的层次化框架 \our{} 包含一个作为高层规划器的世界模型和一个作为低层执行器的 VLA 模型。高层世界模型首先将操作任务分解为带有目标图像的子任务序列，低层策略则遵循文本和视觉引导生成动作序列。与原始的文本目标描述相比，这些合成的目标图像为低层策略提供了视觉和物理层面的具体细节，使其能够泛化至未见过的物体和新颖场景。我们在大量分布外场景中验证了视觉目标合成及层次化 VLA 策略的有效性，结果表明，在世界模型生成的引导下，相同结构的 VLA 模型在新场景中的性能可从 14% 提升至 69%。实验证明，我们的方法明显优于现有基线，尤其在分布外场景中表现突出。项目页面：\href{https://vista-wm.github.io}{https://vista-wm.github.io}