Scaling has not yet been convincingly demonstrated for pure self-supervised learning from video. However, prior work has focused evaluations on semantic-related tasks $\unicode{x2013}$ action classification, ImageNet classification, etc. In this paper we focus on evaluating self-supervised learning on non-semantic vision tasks that are more spatial (3D) and temporal (+1D = 4D), such as camera pose estimation, point and object tracking, and depth estimation. We show that by learning from very large video datasets, masked auto-encoding (MAE) with transformer video models actually scales, consistently improving performance on these 4D tasks, as model size increases from 20M all the way to the largest by far reported self-supervised video model $\unicode{x2013}$ 22B parameters. Rigorous apples-to-apples comparison with many recent image and video models demonstrates the benefits of scaling 4D representations.
翻译:纯自监督视频学习的可扩展性尚未得到令人信服的验证。然而,先前的研究主要将评估集中于语义相关任务——如动作分类、ImageNet分类等。本文重点评估自监督学习在非语义视觉任务上的表现,这些任务更具空间(三维)与时间(+1维=四维)特性,例如相机位姿估计、点与目标跟踪以及深度估计。我们证明,通过从超大规模视频数据集中学习,基于Transformer视频模型的掩码自编码(MAE)确实具备可扩展性:随着模型规模从2000万参数逐步增加至迄今最大的自监督视频模型——220亿参数,这些四维任务的性能持续提升。与近期多种图像及视频模型进行的严格同条件对比实验,充分验证了规模化四维表征学习的优势。