Visual pre-training with large-scale real-world data has made great progress in recent years, showing great potential in robot learning with pixel observations. However, the recipes of visual pre-training for robot manipulation tasks are yet to be built. In this paper, we thoroughly investigate the effects of visual pre-training strategies on robot manipulation tasks from three fundamental perspectives: pre-training datasets, model architectures and training methods. Several significant experimental findings are provided that are beneficial for robot learning. Further, we propose a visual pre-training scheme for robot manipulation termed Vi-PRoM, which combines self-supervised learning and supervised learning. Concretely, the former employs contrastive learning to acquire underlying patterns from large-scale unlabeled data, while the latter aims learning visual semantics and temporal dynamics. Extensive experiments on robot manipulations in various simulation environments and the real robot demonstrate the superiority of the proposed scheme. Videos and more details can be found on \url{https://explore-pretrain-robot.github.io}.
翻译:近年来,基于大规模真实世界数据的视觉预训练取得了重大进展,在基于像素观测的机器人学习中展现出巨大潜力。然而,面向机器人操作任务的视觉预训练方案尚未系统建立。本文从预训练数据集、模型架构和训练方法三个基础维度,深入研究了视觉预训练策略对机器人操作任务的影响,得出了若干对机器人学习具有指导意义的重要实验结论。在此基础上,我们提出了一种面向机器人操作的视觉预训练方案Vi-PRoM,该方案融合了自监督学习与监督学习。具体而言,前者通过对比学习从大规模无标注数据中获取潜在模式,后者则旨在学习视觉语义与时序动态。在不同仿真环境及真实机器人上进行的机器人操作实验充分验证了该方案的优越性。更多视频及细节请访问\url{https://explore-pretrain-robot.github.io}。