Image-generation diffusion models have been fine-tuned to unlock new capabilities such as image-editing and novel view synthesis. Can we similarly unlock image-generation models for visuomotor control? We present GENIMA, a behavior-cloning agent that fine-tunes Stable Diffusion to 'draw joint-actions' as targets on RGB images. These images are fed into a controller that maps the visual targets into a sequence of joint-positions. We study GENIMA on 25 RLBench and 9 real-world manipulation tasks. We find that, by lifting actions into image-space, internet pre-trained diffusion models can generate policies that outperform state-of-the-art visuomotor approaches, especially in robustness to scene perturbations and generalizing to novel objects. Our method is also competitive with 3D agents, despite lacking priors such as depth, keypoints, or motion-planners.
翻译:图像生成扩散模型经过微调后已能解锁图像编辑和新视角合成等新能力。我们能否类似地解锁图像生成模型以用于视觉运动控制?本文提出GENIMA,一种通过微调Stable Diffusion模型使其在RGB图像上"绘制关节动作"作为目标的行为克隆智能体。这些图像被输入控制器,将视觉目标映射为关节位置序列。我们在25个RLBench任务和9个真实世界操作任务上评估GENIMA。研究发现,通过将动作提升至图像空间,互联网预训练的扩散模型生成的策略能够超越最先进的视觉运动方法,尤其在场景扰动鲁棒性和新物体泛化能力方面表现突出。尽管缺乏深度、关键点或运动规划器等先验信息,我们的方法仍能与3D智能体相媲美。