Robot foundation models are beginning to deliver on the promise of generalist robotic agents, yet progress remains constrained by the scarcity of large-scale real-world manipulation datasets. Simulation and synthetic data generation offer a scalable alternative, but their usefulness is limited by the visual domain gap between simulation and reality. In this work, we present Point Bridge, a framework that leverages unified, domain-agnostic point-based representations to unlock synthetic datasets for zero-shot sim-to-real policy transfer, without explicit visual or object-level alignment. Point Bridge combines automated point-based representation extraction via Vision-Language Models (VLMs), transformer-based policy learning, and efficient inference-time pipelines to train capable real-world manipulation agents using only synthetic data. With additional co-training on small sets of real demonstrations, Point Bridge further improves performance, substantially outperforming prior vision-based sim-and-real co-training methods. It achieves up to 44% gains in zero-shot sim-to-real transfer and up to 66% with limited real data across both single-task and multitask settings. Videos of the robot are best viewed at: https://pointbridge3d.github.io/
翻译:机器人基础模型正逐步实现通用智能体代理的愿景,然而其进展仍受限于大规模真实世界操作数据集的稀缺性。仿真与合成数据生成提供了可扩展的替代方案,但其有效性受限于仿真与真实场景间的视觉域差异。本研究提出点桥框架,该框架利用统一的、域无关的点云表征,无需显式的视觉或物体级对齐,即可实现合成数据集驱动的零样本仿真到现实策略迁移。点桥融合了基于视觉语言模型的自动化点云表征提取、基于Transformer的策略学习以及高效的推理时流水线,仅使用合成数据即可训练出具备实际操作能力的机器人智能体。通过在小规模真实演示数据上进行协同训练,点桥进一步提升了性能,显著超越了先前基于视觉的仿真-现实协同训练方法。在单任务与多任务场景下,该方法实现了最高44%的零样本仿真到现实迁移性能提升,结合少量真实数据后提升幅度可达66%。机器人演示视频请访问:https://pointbridge3d.github.io/