Robot manipulation relying on learned object-centric descriptors became popular in recent years. Visual descriptors can easily describe manipulation task objectives, they can be learned efficiently using self-supervision, and they can encode actuated and even non-rigid objects. However, learning robust, view-invariant keypoints in a self-supervised approach requires a meticulous data collection approach involving precise calibration and expert supervision. In this paper we introduce Cycle-Correspondence Loss (CCL) for view-invariant dense descriptor learning, which adopts the concept of cycle-consistency, enabling a simple data collection pipeline and training on unpaired RGB camera views. The key idea is to autonomously detect valid pixel correspondences by attempting to use a prediction over a new image to predict the original pixel in the original image, while scaling error terms based on the estimated confidence. Our evaluation shows that we outperform other self-supervised RGB-only methods, and approach performance of supervised methods, both with respect to keypoint tracking as well as for a robot grasping downstream task.
翻译:近年来,依赖学习到的以物体为中心描述符的机器人操作日益普及。视觉描述符能够轻松描述操作任务目标,可通过自监督高效学习,并能编码可驱动甚至非刚性物体。然而,通过自监督方法学习鲁棒的视角不变关键点需要精心设计的数据采集流程,涉及精确标定和专家监督。本文提出用于视角不变密集描述符学习的循环一致性损失,该方法采用循环一致性概念,实现了简化的数据采集流程,并能在未配对的RGB相机视角上进行训练。其核心思想是通过尝试使用新图像的预测结果来预测原始图像中的对应像素,从而自主检测有效的像素对应关系,同时根据估计置信度对误差项进行加权。评估结果表明,本方法在关键点跟踪及机器人抓取下游任务中均优于其他仅使用RGB的自监督方法,并接近监督方法的性能水平。