Motivated by the intuitive understanding humans have about the space of possible interactions, and the ease with which they can generalize this understanding to previously unseen scenes, we develop an approach for learning visual affordances for guiding robot exploration. Given an input image of a scene, we infer a distribution over plausible future states that can be achieved via interactions with it. We use a Transformer-based model to learn a conditional distribution in the latent embedding space of a VQ-VAE and show that these models can be trained using large-scale and diverse passive data, and that the learned models exhibit compositional generalization to diverse objects beyond the training distribution. We show how the trained affordance model can be used for guiding exploration by acting as a goal-sampling distribution, during visual goal-conditioned policy learning in robotic manipulation.
翻译:受人类对可能交互空间的直观理解以及将这种理解轻松推广到未见场景的启发,我们提出了一种学习视觉可供性的方法,用于引导机器人探索。给定场景的输入图像,我们推断通过与之交互可实现的可信未来状态分布。我们采用基于Transformer的模型,学习VQ-VAE潜在嵌入空间中的条件分布,并证明这些模型可利用大规模多样化的被动数据进行训练,且所学模型对训练分布之外的多样化物体展现出组合泛化能力。我们展示了该训练好的可供性模型如何作为目标采样分布,在机器人操作中用于视觉目标条件策略学习时引导探索。