Large foundation models have shown strong open-world generalization to complex problems in vision and language, but similar levels of generalization have yet to be achieved in robotics. One fundamental challenge is that the models exhibit limited zero-shot capability, which hampers their ability to generalize effectively to unseen scenarios. In this work, we propose GeneralVLA (Generalizable Vision-Language-Action Models with Knowledge-Guided Trajectory Planning), a hierarchical vision-language-action (VLA) model that can be more effective in utilizing the generalization of foundation models, enabling zero-shot manipulation and automatically generating data for robotics. In particular, we study a class of hierarchical VLA model where the high-level ASM (Affordance Segmentation Module) is finetuned to perceive image keypoint affordances of the scene; the mid-level 3DAgent carries out task understanding, skill knowledge, and trajectory planning to produce a 3D path indicating the desired robot end-effector trajectory. The intermediate 3D path prediction is then served as guidance to the low-level, 3D-aware control policy capable of precise manipulation. Compared to alternative approaches, our method requires no real-world robotic data collection or human demonstration, making it much more scalable to diverse tasks and viewpoints. Empirically, GeneralVLA successfully generates trajectories for 14 tasks, significantly outperforming state-of-the-art methods such as VoxPoser. The generated demonstrations can train more robust behavior cloning policies than training with human demonstrations or from data generated by VoxPoser, Scaling-up, and Code-As-Policies. We believe GeneralVLA can be the scalable method for both generating data for robotics and solving novel tasks in a zero-shot setting. Code: https://github.com/AIGeeksGroup/GeneralVLA. Website: https://aigeeksgroup.github.io/GeneralVLA.
翻译:大型基础模型已在视觉与语言领域展现出对复杂问题的强大开放世界泛化能力,但机器人学领域尚未实现同等水平的泛化。一个根本性挑战在于现有模型表现出有限的零样本能力,这阻碍了其有效泛化至未见场景。本文提出GeneralVLA(具有知识引导轨迹规划能力的通用视觉-语言-动作模型),这是一种分层视觉-语言-动作模型,能更有效地利用基础模型的泛化能力,实现零样本操作并自动生成机器人学数据。具体而言,我们研究一类分层VLA模型:高层ASM(可供性分割模块)经微调后可感知场景的图像关键点可供性;中层3DAgent执行任务理解、技能知识与轨迹规划,生成指示期望机器人末端执行器轨迹的3D路径。预测的中间3D路径随后作为低层、具备3D感知能力的控制策略的引导,实现精确操作。相较于现有方法,本方法无需真实世界机器人数据收集或人工示教,显著提升了面向多样化任务与视角的可扩展性。实验表明,GeneralVLA成功为14项任务生成轨迹,显著优于VoxPoser等前沿方法。所生成的示教数据训练出的行为克隆策略,其鲁棒性超越基于人工示教或VoxPoser、Scaling-up、Code-As-Policies生成数据训练的策略。我们相信GeneralVLA有望成为兼具机器人数据生成与零样本场景下新任务求解能力的可扩展方法。代码:https://github.com/AIGeeksGroup/GeneralVLA。项目网站:https://aigeeksgroup.github.io/GeneralVLA。