Visuomotor Control in Multi-Object Scenes Using Object-Aware Representations

Perceptual understanding of the scene and the relationship between its different components is important for successful completion of robotic tasks. Representation learning has been shown to be a powerful technique for this, but most of the current methodologies learn task specific representations that do not necessarily transfer well to other tasks. Furthermore, representations learned by supervised methods require large labeled datasets for each task that are expensive to collect in the real world. Using self-supervised learning to obtain representations from unlabeled data can mitigate this problem. However, current self-supervised representation learning methods are mostly object agnostic, and we demonstrate that the resulting representations are insufficient for general purpose robotics tasks as they fail to capture the complexity of scenes with many components. In this paper, we explore the effectiveness of using object-aware representation learning techniques for robotic tasks. Our self-supervised representations are learned by observing the agent freely interacting with different parts of the environment and is queried in two different settings: (i) policy learning and (ii) object location prediction. We show that our model learns control policies in a sample-efficient manner and outperforms state-of-the-art object agnostic techniques as well as methods trained on raw RGB images. Our results show a 20 percent increase in performance in low data regimes (1000 trajectories) in policy training using implicit behavioral cloning (IBC). Furthermore, our method outperforms the baselines for the task of object localization in multi-object scenes.

翻译：场景的感知理解及其不同组成部分之间的关系对于机器人任务的成功完成至关重要。表征学习已被证明是实现这一目标的有效技术，但目前大多数方法学习的是任务特定表征，这些表征未必能很好地迁移到其他任务中。此外，有监督方法学习的表征需要为每个任务收集大量标注数据集，这在现实世界中成本高昂。利用自监督学习从无标注数据中获取表征可以缓解这一问题。然而，当前的自监督表征学习方法大多与物体无关，我们证明由此产生的表征不足以完成通用机器人任务，因为它们无法捕捉包含多个组件的场景的复杂性。本文探索了在机器人任务中使用物体感知表征学习技术的有效性。我们的自监督表征通过观察智能体与环境不同部分自由交互来学习，并在两种不同场景中进行评估：(i) 策略学习与(ii) 物体位置预测。结果表明，我们的模型以样本高效的方式学习控制策略，性能优于最先进的物体无关技术以及基于原始RGB图像训练的方法。在低数据量场景（1000条轨迹）下，使用隐式行为克隆（IBC）进行策略训练时，我们的方法性能提升了20%。此外，在多物体场景中的物体定位任务上，我们的方法也超越了基线模型。