In this paper, we show that useful video representations can be learned from synthetic videos and natural images, without incorporating natural videos in the training. We propose a progression of video datasets synthesized by simple generative processes, that model a growing set of natural video properties (e.g. motion, acceleration, and shape transformations). The downstream performance of video models pre-trained on these generated datasets gradually increases with the dataset progression. A VideoMAE model pre-trained on our synthetic videos closes 97.2% of the performance gap on UCF101 action classification between training from scratch and self-supervised pre-training from natural videos, and outperforms the pre-trained model on HMDB51. Introducing crops of static images to the pre-training stage results in similar performance to UCF101 pre-training and outperforms the UCF101 pre-trained model on 11 out of 14 out-of-distribution datasets of UCF101-P. Analyzing the low-level properties of the datasets, we identify correlations between frame diversity, frame similarity to natural data, and downstream performance. Our approach provides a more controllable and transparent alternative to video data curation processes for pre-training.
翻译:本文表明,无需在训练中引入自然视频,仅通过合成视频与自然图像即可学习有效的视频表征。我们提出了一系列通过简单生成过程合成的视频数据集,这些数据集逐步建模了不断增长的自然视频属性(如运动、加速度与形状变换)。基于这些生成数据集进行预训练的视频模型,其下游性能随数据集递进而逐步提升。使用我们合成视频预训练的VideoMAE模型,在UCF101动作分类任务上,将从头训练与基于自然视频的自监督预训练之间的性能差距缩小了97.2%,并在HMDB51数据集上超越了自然视频预训练模型。在预训练阶段引入静态图像裁剪块,可在UCF101上达到与自然视频预训练相当的性能,并在UCF101-P的14个分布外数据集中,有11个的表现优于UCF101预训练模型。通过分析数据集的底层属性,我们发现帧多样性、帧与自然数据的相似度与下游性能之间存在相关性。本方法为预训练的视频数据筛选过程提供了一种更可控、更透明的替代方案。