Humans anticipate, from a glance and a contemplated action of their bodies, how the 3D world will respond, a capability that is equally vital for robotic manipulation. We introduce PointWorld, a large pre-trained 3D world model that unifies state and action in a shared 3D space as 3D point flows: given one or few RGB-D images and a sequence of low-level robot action commands, PointWorld forecasts per-pixel displacements in 3D that respond to the given actions. By representing actions as 3D point flows instead of embodiment-specific action spaces (e.g., joint positions), this formulation directly conditions on physical geometries of robots while seamlessly integrating learning across embodiments. To train our 3D world model, we curate a large-scale dataset spanning real and simulated robotic manipulation in open-world environments, enabled by recent advances in 3D vision and simulated environments, totaling about 2M trajectories and 500 hours across a single-arm Franka and a bimanual humanoid. Through rigorous, large-scale empirical studies of backbones, action representations, learning objectives, partial observability, data mixtures, domain transfers, and scaling, we distill design principles for large-scale 3D world modeling. With a real-time (0.1s) inference speed, PointWorld can be efficiently integrated in the model-predictive control (MPC) framework for manipulation. We demonstrate that a single pre-trained checkpoint enables a real-world Franka robot to perform rigid-body pushing, deformable and articulated object manipulation, and tool use, without requiring any demonstrations or post-training and all from a single image captured in-the-wild. Project website at https://point-world.github.io/.
翻译:人类能够通过一瞥及对自身动作的思考,预判三维世界将如何响应——这种能力对机器人操作同样至关重要。本文提出PointWorld,一种大型预训练三维世界模型,它将状态与动作统一表示为共享三维空间中的三维点流:给定单幅或多幅RGB-D图像及一系列低层级机器人动作指令,PointWorld能够预测响应给定动作的像素级三维位移。通过将动作表示为三维点流而非特定具身动作空间(如关节位置),该建模方式可直接以机器人的物理几何结构为条件,同时无缝整合跨具身形态的学习。为训练该三维世界模型,我们构建了涵盖开放世界环境中真实与仿真机器人操作的大规模数据集——借助三维视觉与仿真环境的最新进展,数据集总计约200万条轨迹、500小时时长,覆盖单臂Franka机器人及双手机器人。通过对骨干网络、动作表示、学习目标、部分可观测性、数据混合、领域迁移及可扩展性进行严谨的大规模实证研究,我们提炼出大规模三维世界建模的设计原则。凭借实时(0.1秒)推理速度,PointWorld可高效集成至模型预测控制(MPC)框架中用于操作任务。实验表明,单个预训练模型即可使真实世界的Franka机器人完成刚体推动、可变形与关节物体操作及工具使用等任务,且无需任何演示数据或后训练,仅需单幅野外捕获图像。项目网站:https://point-world.github.io/。