Deploying robots at scale demands robustness to the long tail of everyday situations. The countless variations in scene layout, object geometry, and task specifications that characterize real environments are vast and underrepresented in existing robot benchmarks. Measuring this level of generalization requires infrastructure at a scale and diversity that physical evaluation alone cannot provide. We introduce MolmoSpaces, a fully open ecosystem to support large-scale benchmarking of robot policies. MolmoSpaces consists of over 230k diverse indoor environments, ranging from handcrafted household scenes to procedurally generated multiroom houses, populated with 130k richly annotated object assets, including 48k manipulable objects with 42M stable grasps. Crucially, these environments are simulator-agnostic, supporting popular options such as MuJoCo, Isaac, and ManiSkill. The ecosystem supports the full spectrum of embodied tasks: static and mobile manipulation, navigation, and multiroom long-horizon tasks requiring coordinated perception, planning, and interaction across entire indoor environments. We also design MolmoSpaces-Bench, a benchmark suite of 8 tasks in which robots interact with our diverse scenes and richly annotated objects. Our experiments show MolmoSpaces-Bench exhibits strong sim-to-real correlation (R = 0.96, \r{ho} = 0.98), confirm newer and stronger zero-shot policies outperform earlier versions in our benchmarks, and identify key sensitivities to prompt phrasing, initial joint positions, and camera occlusion. Through MolmoSpaces and its open-source assets and tooling, we provide a foundation for scalable data generation, policy training, and benchmark creation for robot learning research.
翻译:大规模部署机器人需要其应对日常场景中的长尾分布具备鲁棒性。真实环境中场景布局、物体几何与任务规范的无限变化极其广泛,而现有机器人基准测试对此类变化覆盖不足。衡量这种泛化能力需要具备物理评估无法单独提供的规模与多样性基础设施。我们推出MolmoSpaces——一个完全开放的生态系统,以支持机器人策略的大规模基准测试。MolmoSpaces包含超过23万个多样化室内环境,涵盖手工构建的家庭场景到程序化生成的多房间住宅,并配置了13万个带有丰富标注的物体资产,其中包括4.8万个可操作物体及其4200万个稳定抓取位姿。关键的是,这些环境与模拟器无关,支持MuJoCo、Isaac和ManiSkill等主流选项。该生态系统支持完整的具身任务谱系:静态与移动操作、导航,以及需要在整个室内环境中协调感知、规划与交互的多房间长时程任务。我们还设计了MolmoSpaces-Bench基准测试套件,包含8项任务,机器人可在其中与我们多样化的场景及富含标注的物体进行交互。实验表明:MolmoSpaces-Bench展现出强烈的仿真到现实相关性(R = 0.96,ρ = 0.98);验证了新版零样本策略在基准测试中优于早期版本;并揭示了策略对提示语表述、初始关节位姿及相机遮挡的关键敏感性。通过MolmoSpaces及其开源资产与工具链,我们为机器人学习研究提供了可扩展的数据生成、策略训练与基准创建的基础设施。