Robot foundation models are beginning to deliver on the promise of generalist robotic agents, yet progress remains constrained by the scarcity of large-scale real-world manipulation datasets. Simulation and synthetic data generation offer a scalable alternative, but their usefulness is limited by the visual domain gap between simulation and reality. In this work, we present Point Bridge, a framework that leverages unified, domain-agnostic point-based representations to unlock synthetic datasets for zero-shot sim-to-real policy transfer, without explicit visual or object-level alignment. Point Bridge combines automated point-based representation extraction via Vision-Language Models (VLMs), transformer-based policy learning, and efficient inference-time pipelines to train capable real-world manipulation agents using only synthetic data. With additional co-training on small sets of real demonstrations, Point Bridge further improves performance, substantially outperforming prior vision-based sim-and-real co-training methods. It achieves up to 44% gains in zero-shot sim-to-real transfer and up to 66% with limited real data across both single-task and multitask settings. Videos of the robot are best viewed at: https://pointbridge3d.github.io/
翻译:机器人基础模型正在逐步实现通用机器人智能体的愿景,然而进展仍受限于大规模真实世界操作数据集的稀缺性。仿真与合成数据生成提供了可扩展的替代方案,但其有效性受限于仿真与现实之间的视觉域差异。本工作提出点桥框架,该框架利用统一的、领域无关的基于点的表示,无需显式的视觉或对象级对齐,即可实现零样本仿真到现实策略迁移,从而释放合成数据集的潜力。点桥结合了通过视觉语言模型实现的自动化点表示提取、基于Transformer的策略学习以及高效的推理时流水线,仅使用合成数据即可训练出具备真实世界操作能力的智能体。通过在小规模真实演示数据上进行协同训练,点桥进一步提升了性能,显著优于先前基于视觉的仿真与真实数据协同训练方法。在单任务和多任务场景下,该方法在零样本仿真到现实迁移中实现了高达44%的性能提升,在有限真实数据条件下更可达到66%的提升。机器人演示视频请访问:https://pointbridge3d.github.io/