The rise of generalist robotic policies has created an exponential demand for large-scale training data. However, on-robot data collection is labor-intensive and often limited to specific environments. In contrast, open-world images capture a vast diversity of real-world scenes that naturally align with robotic manipulation tasks, offering a promising avenue for low-cost, large-scale robot data acquisition. Despite this potential, the lack of associated robot actions hinders the practical use of open-world images for robot learning, leaving this rich visual resource largely unexploited. To bridge this gap, we propose IGen, a framework that scalably generates realistic visual observations and executable actions from open-world images. IGen first converts unstructured 2D pixels into structured 3D scene representations suitable for scene understanding and manipulation. It then leverages the reasoning capabilities of vision-language models to transform scene-specific task instructions into high-level plans and generate low-level actions as SE(3) end-effector pose sequences. From these poses, it synthesizes dynamic scene evolution and renders temporally coherent visual observations. Experiments validate the high quality of visuomotor data generated by IGen, and show that policies trained solely on IGen-synthesized data achieve performance comparable to those trained on real-world data. This highlights the potential of IGen to support scalable data generation from open-world images for generalist robotic policy training.
翻译:通用机器人策略的兴起引发了对大规模训练数据的指数级需求。然而,基于机器人的数据收集过程劳动密集,且通常局限于特定环境。相比之下,开放世界图像捕捉了与现实世界机器人操作任务自然契合的多样化场景,为低成本、大规模机器人数据获取提供了前景广阔的途径。尽管具备这种潜力,但缺乏关联的机器人动作阻碍了开放世界图像在机器人学习中的实际应用,使得这一丰富的视觉资源未被充分利用。为弥合这一差距,我们提出IGen框架,该框架能够从开放世界图像中可扩展地生成真实的视觉观测与可执行动作。IGen首先将非结构化的二维像素转换为适用于场景理解与操作的规范化三维场景表征,随后利用视觉语言模型的推理能力,将场景特定任务指令转化为高层规划,并以SE(3)末端执行器位姿序列的形式生成低层动作。基于这些位姿,框架合成了动态场景演化并渲染出时间连贯的视觉观测。实验验证了IGen生成的视觉运动数据具有高质量特性,并且仅使用IGen合成数据训练的策略达到了与真实世界数据训练策略相当的性能。这凸显了IGen在支持基于开放世界图像的可扩展数据生成以训练通用机器人策略方面的潜力。