When humans perform a task with an articulated object, they interact with the object only in a handful of ways, while the space of all possible interactions is nearly endless. This is because humans have prior knowledge about what interactions are likely to be successful, i.e., to open a new door we first try the handle. While learning such priors without supervision is easy for humans, it is notoriously hard for machines. In this work, we tackle unsupervised learning of priors of useful interactions with articulated objects, which we call interaction modes. In contrast to the prior art, we use no supervision or privileged information; we only assume access to the depth sensor in the simulator to learn the interaction modes. More precisely, we define a successful interaction as the one changing the visual environment substantially and learn a generative model of such interactions, that can be conditioned on the desired goal state of the object. In our experiments, we show that our model covers most of the human interaction modes, outperforms existing state-of-the-art methods for affordance learning, and can generalize to objects never seen during training. Additionally, we show promising results in the goal-conditional setup, where our model can be quickly fine-tuned to perform a given task. We show in the experiments that such affordance learning predicts interaction which covers most modes of interaction for the querying articulated object and can be fine-tuned to a goal-conditional model. For supplementary: https://actaim.github.io.
翻译:当人类操作铰接物体执行任务时,仅会以有限几种方式与物体交互,而所有可能的交互空间近乎无限。这是因为人类具备先验知识,了解哪些交互可能成功——例如,要打开一扇新门,我们首先尝试门把手。对人类而言,无需监督即可习得此类先验知识轻松自如,但对机器来说却极为困难。本研究致力于无监督学习与铰接物体进行有效交互的先验知识,我们将此类交互称为"交互模式"。与现有方法不同,我们无需任何监督或特权信息:仅假设可访问模拟器中的深度传感器来学习交互模式。具体而言,我们将成功交互定义为能显著改变视觉环境的交互,并为此类交互学习生成式模型,该模型可依据物体的期望目标状态进行条件化。实验表明,我们的模型覆盖了人类大部分交互模式,在可供性学习任务中超越现有最优方法,且能泛化至训练中未见过的物体。此外,我们在目标条件化设置中取得了令人瞩目的成果——该模型可快速微调以执行给定任务。实验证明,这种可供性学习所预测的交互能覆盖待查询铰接物体的大部分交互模式,并可微调为目标条件化模型。补充材料详见:https://actaim.github.io。