Self-supervised learning of image representations by predicting future frames is a promising direction but still remains a challenge. This is because of the under-determined nature of frame prediction; multiple potential futures can arise from a single current frame. To tackle this challenge, in this paper, we revisit the idea of stochastic video generation that learns to capture uncertainty in frame prediction and explore its effectiveness for representation learning. Specifically, we design a framework that trains a stochastic frame prediction model to learn temporal information between frames. Moreover, to learn dense information within each frame, we introduce an auxiliary masked image modeling objective along with a shared decoder architecture. We find this architecture allows for combining both objectives in a synergistic and compute-efficient manner. We demonstrate the effectiveness of our framework on a variety of tasks from video label propagation and vision-based robot learning domains, such as video segmentation, pose tracking, vision-based robotic locomotion, and manipulation tasks. Code is available on the project webpage: https://sites.google.com/view/2024rsp.
翻译:通过预测未来帧进行图像表示的自监督学习是一个有前景的方向,但仍面临挑战。这源于帧预测本身的不确定性:从单个当前帧可能衍生出多种潜在的未来帧。为应对这一挑战,本文重新审视了学习捕捉帧预测不确定性的随机视频生成思想,并探索其在表示学习中的有效性。具体而言,我们设计了一个框架,通过训练随机帧预测模型来学习帧间的时间信息。此外,为学习每帧内部的密集信息,我们引入了辅助的掩码图像建模目标,并采用共享解码器架构。我们发现该架构能以协同且计算高效的方式结合这两个目标。我们在视频标签传播和基于视觉的机器人学习领域的多种任务上验证了框架的有效性,包括视频分割、姿态跟踪、基于视觉的机器人运动以及操作任务。代码发布于项目网页:https://sites.google.com/view/2024rsp。