The vision-language-action (VLA) paradigm has enabled powerful robotic control by leveraging vision-language models, but its reliance on large-scale, high-quality robot data limits its generalization. Generative world models offer a promising alternative for general-purpose embodied AI, yet a critical gap remains between their pixel-level plans and physically executable actions. To this end, we propose the Tool-Centric Inverse Dynamics Model (TC-IDM). By focusing on the tool's imagined trajectory as synthesized by the world model, TC-IDM establishes a robust intermediate representation that bridges the gap between visual planning and physical control. TC-IDM extracts the tool's point cloud trajectories via segmentation and 3D motion estimation from generated videos. Considering diverse tool attributes, our architecture employs decoupled action heads to project these planned trajectories into 6-DoF end-effector motions and corresponding control signals. This plan-and-translate paradigm not only supports a wide range of end-effectors but also significantly improves viewpoint invariance. Furthermore, it exhibits strong generalization capabilities across long-horizon and out-of-distribution tasks, including interacting with deformable objects. In real-world evaluations, the world model with TC-IDM achieves an average success rate of 61.11 percent, with 77.7 percent on simple tasks and 38.46 percent on zero-shot deformable object tasks. It substantially outperforms end-to-end VLA-style baselines and other inverse dynamics models.
翻译:视觉-语言-动作范式通过利用视觉-语言模型实现了强大的机器人控制,但其对大规模高质量机器人数据的依赖限制了其泛化能力。生成式世界模型为通用具身智能提供了一种有前景的替代方案,然而,其像素级规划与物理可执行动作之间仍存在关键鸿沟。为此,我们提出了工具中心化逆动力学模型。通过聚焦于世界模型所合成的工具想象轨迹,TC-IDM建立了一个鲁棒的中间表示,从而弥合了视觉规划与物理控制之间的差距。TC-IDM通过从生成视频中进行分割和三维运动估计,提取工具的点云轨迹。考虑到工具属性的多样性,我们的架构采用解耦的动作头,将这些规划轨迹映射为6自由度末端执行器运动及相应的控制信号。这种“规划-转译”范式不仅支持广泛的末端执行器类型,还显著提升了视角不变性。此外,它在长时程任务和分布外任务(包括与可变形物体交互)上展现出强大的泛化能力。在真实世界评估中,结合TC-IDM的世界模型实现了61.11%的平均成功率,其中简单任务成功率为77.7%,零样本可变形物体任务成功率为38.46%。其性能显著优于端到端视觉-语言-动作范式基线及其他逆动力学模型。