MolmoB0T: Large-Scale Simulation Enables Zero-Shot Manipulation

Abhay Deshpande,Maya Guru,Rose Hendrix,Snehal Jauhri,Ainaz Eftekhar,Rohun Tripathi,Max Argus,Jordi Salvador,Haoquan Fang,Matthew Wallingford,Wilbert Pumacay,Yejin Kim,Quinn Pfeifer,Ying-Chun Lee,Piper Wolters,Omar Rayyan,Mingtong Zhang,Jiafei Duan,Karen Farley,Winson Han,Eli Vanderbilt,Dieter Fox,Ali Farhadi,Georgia Chalvatzaki,Dhruv Shah,Ranjay Krishna

A prevailing view in robot learning is that simulation alone is not enough; effective sim-to-real transfer is widely believed to require at least some real-world data collection or task-specific fine-tuning to bridge the gap between simulated and physical environments. We challenge that assumption. With sufficiently large-scale and diverse simulated synthetic training data, we show that zero-shot transfer to the real world is not only possible, but effective for both static and mobile manipulation. We introduce MolmoBot-Engine, a fully open-source pipeline for procedural data generation across robots, tasks, and diverse simulated environments in MolmoSpaces. With it, we release MolmoBot-Data, a dataset of 1.8 million expert trajectories for articulated object manipulation and pick-and-place tasks. We train three policy classes: MolmoBot, a Molmo2-based multi-frame vision-language model with a flow-matching action head; MolmoBot-Pi0, which replicates the $π_0$ architecture to enable direct comparison; and MolmoBot-SPOC, a lightweight policy suitable for edge deployment and amenable to RL fine-tuning. We evaluate on two robotic platforms: the Franka FR3 for tabletop manipulation tasks and the Rainbow Robotics RB-Y1 mobile manipulator for door opening, drawer manipulation, cabinet interaction, and mobile pick-and-place. Without any real-world fine-tuning, our policies achieve zero-shot transfer to unseen objects and environments. On tabletop pick-and-place, MolmoBot achieves a success rate of 79.2% in real world evaluations across 4 settings, outperforming $π_{0.5}$ at 39.2%. Our results demonstrate that procedural environment generation combined with diverse articulated assets can produce robust manipulation policies that generalize broadly to the real world. Technical Blog: https://allenai.org/blog/molmobot-robot-manipulation

翻译：机器人学习领域的主流观点认为，仅靠仿真是不够的；人们普遍认为，要实现有效的仿真到现实迁移，至少需要收集部分真实世界数据或进行任务特定的微调，以弥合仿真环境与物理环境之间的差距。我们挑战了这一假设。通过使用足够大规模且多样化的仿真合成训练数据，我们证明了零样本迁移至现实世界不仅是可能的，而且对于静态和移动操作任务均有效。我们介绍了MolmoBot-Engine，这是一个完全开源的流程，用于在MolmoSpaces中跨机器人、任务和多样化仿真环境进行程序化数据生成。基于此，我们发布了MolmoBot-Data数据集，其中包含180万条用于铰接物体操作和抓取放置任务的专家轨迹。我们训练了三种策略类别：MolmoBot，一种基于Molmo2的多帧视觉语言模型，配备流匹配动作头；MolmoBot-Pi0，其复现了$π_0$架构以实现直接比较；以及MolmoBot-SPOC，一种适用于边缘部署且易于进行强化学习微调的轻量级策略。我们在两个机器人平台上进行评估：用于桌面操作任务的Franka FR3，以及用于开门、抽屉操作、柜体交互和移动抓取放置任务的Rainbow Robotics RB-Y1移动操作器。在未进行任何真实世界微调的情况下，我们的策略实现了对未见过的物体和环境的零样本迁移。在桌面抓取放置任务中，MolmoBot在真实世界评估的4种设置中取得了79.2%的成功率，优于$π_{0.5}$的39.2%。我们的结果表明，程序化环境生成结合多样化的铰接资产，能够产生鲁棒的操作策略，并广泛泛化至现实世界。技术博客：https://allenai.org/blog/molmobot-robot-manipulation