The rise of generalist robotic policies has created an exponential demand for large-scale training data. However, on-robot data collection is labor-intensive and often limited to specific environments. In contrast, open-world images capture a vast diversity of real-world scenes that naturally align with robotic manipulation tasks, offering a promising avenue for low-cost, large-scale robot data acquisition. Despite this potential, the lack of associated robot actions hinders the practical use of open-world images for robot learning, leaving this rich visual resource largely unexploited. To bridge this gap, we propose IGen, a framework that scalably generates realistic visual observations and executable actions from open-world images. IGen first converts unstructured 2D pixels into structured 3D scene representations suitable for scene understanding and manipulation. It then leverages the reasoning capabilities of vision-language models to transform scene-specific task instructions into high-level plans and generate low-level actions as SE(3) end-effector pose sequences. From these poses, it synthesizes dynamic scene evolution and renders temporally coherent visual observations. Experiments validate the high quality of visuomotor data generated by IGen, and show that policies trained solely on IGen-synthesized data achieve performance comparable to those trained on real-world data. This highlights the potential of IGen to support scalable data generation from open-world images for generalist robotic policy training.
翻译:通用机器人策略的兴起催生了对大规模训练数据的指数级需求。然而,基于机器人的数据采集既费时耗力,又往往局限于特定环境。相比之下,开放世界图像捕捉了与机器人操作任务自然对齐的多样化真实场景,为低成本、大规模机器人数据采集提供了极具前景的途径。尽管潜力巨大,但缺乏关联的机器人动作阻碍了开放世界图像在机器人学习中的实际应用,使得这一丰富的视觉资源尚未被充分开发。为弥补这一差距,我们提出IGen框架,该框架能从开放世界图像中可扩展地生成逼真的视觉观测与可执行动作。IGen首先将非结构化二维像素转化为适合场景理解与操作的结构化三维场景表征;随后利用视觉语言模型的推理能力,将场景特定任务指令转化为高层规划,并生成低层动作——以SE(3)末端执行器位姿序列表示。基于这些位姿,它合成动态场景演化过程,并渲染时序一致的视觉观测。实验验证了IGen生成的视觉-运动数据的高质量,并表明仅基于IGen合成数据训练的策略,其性能可与基于真实世界数据训练的策略相媲美。这凸显了IGen在支持从开放世界图像进行可扩展数据生成、用于通用机器人策略训练方面的潜力。