Robotic motor control necessitates the ability to predict the dynamics of environments and interaction objects. However, advanced self-supervised pre-trained visual representations (PVRs) in robotic motor control, leveraging large-scale egocentric videos, often focus solely on learning the static content features of sampled image frames. This neglects the crucial temporal motion clues in human video data, which implicitly contain key knowledge about sequential interacting and manipulating with the environments and objects. In this paper, we present a simple yet effective robotic motor control visual pre-training framework that jointly performs spatiotemporal predictive learning utilizing large-scale video data, termed as STP. Our STP samples paired frames from video clips. It adheres to two key designs in a multi-task learning manner. First, we perform spatial prediction on the masked current frame for learning content features. Second, we utilize the future frame with an extremely high masking ratio as a condition, based on the masked current frame, to conduct temporal prediction of future frame for capturing motion features. These efficient designs ensure that our representation focusing on motion information while capturing spatial details. We carry out the largest-scale evaluation of PVRs for robotic motor control to date, which encompasses 21 tasks within a real-world Franka robot arm and 5 simulated environments. Extensive experiments demonstrate the effectiveness of STP as well as unleash its generality and data efficiency by further post-pre-training and hybrid pre-training.
翻译:机器人运动控制需要具备预测环境及交互对象动态变化的能力。然而,目前机器人运动控制中基于大规模第一人称视频数据的高级自监督预训练视觉表征(PVRs),通常仅关注采样图像帧的静态内容特征学习。这忽略了人类视频数据中关键的时序运动线索——这些线索隐含了与环境及物体进行顺序交互和操作的潜在知识。本文提出一种简洁高效的机器人运动控制视觉预训练框架,通过大规模视频数据联合进行时空预测学习,简称STP。该框架从视频片段中采样配对帧,以多任务学习方式遵循两个关键设计:首先,对掩码处理的当前帧进行空间预测以学习内容特征;其次,基于掩码当前帧,采用极高掩码率的未来帧作为条件,对当前帧的未来帧进行时序预测以捕获运动特征。这些高效设计确保表征在聚焦运动信息的同时捕捉空间细节。我们开展了迄今为止规模最大的PVRs机器人运动控制评估,涵盖真实Franka机械臂与5个仿真环境的21项任务。大量实验证明了STP的有效性,并通过后续预训练与混合预训练进一步展现了其泛化性与数据效率。