Humans learn powerful representations of objects and scenes by observing how they evolve over time. Yet, outside of specific tasks that require explicit temporal understanding, static image pretraining remains the dominant paradigm for learning visual foundation models. We question this mismatch, and ask whether video pretraining can yield visual representations that bear the hallmarks of human perception: generalisation across tasks, robustness to perturbations, and consistency with human judgements. To that end we propose a novel procedure for curating videos, and develop a contrastive framework which learns from the complex transformations therein. This simple paradigm for distilling knowledge from videos, called VITO, yields general representations that far outperform prior video pretraining methods on image understanding tasks, and image pretraining methods on video understanding tasks. Moreover, VITO representations are significantly more robust to natural and synthetic deformations than image-, video-, and adversarially-trained ones. Finally, VITO's predictions are strongly aligned with human judgements, surpassing models that were specifically trained for that purpose. Together, these results suggest that video pretraining could be a simple way of learning unified, robust, and human-aligned representations of the visual world.
翻译:人类通过观察物体和场景随时间的变化来学习强大的表征。然而,在需要明确时间理解的具体任务之外,静态图像预训练仍然是学习视觉基础模型的主流范式。我们对这种错位提出质疑,并探究视频预训练能否产生具有人类感知特征的视觉表征:跨任务泛化性、对扰动的鲁棒性以及与人类判断的一致性。为此,我们提出了一种新颖的视频策展流程,并开发了一种能够从视频中复杂时空变换中学习的对比学习框架。这种从视频中提炼知识的简单范式(称为VITO)所生成的通用表征,在图像理解任务上显著优于先前的视频预训练方法,在视频理解任务上则优于图像预训练方法。此外,VITO表征对自然与合成形变的鲁棒性显著强于图像、视频和对抗训练获得的表征。最后,VITO的预测结果与人类判断高度一致,超越了专门为此目标训练的模型。这些结果表明,视频预训练可能是学习统一、鲁棒且与人类视觉感知一致的表征的简单途径。