Pre-training for Action Recognition with Automatically Generated Fractal Datasets

In recent years, interest in synthetic data has grown, particularly in the context of pre-training the image modality to support a range of computer vision tasks, including object classification, medical imaging etc. Previous work has demonstrated that synthetic samples, automatically produced by various generative processes, can replace real counterparts and yield strong visual representations. This approach resolves issues associated with real data such as collection and labeling costs, copyright and privacy. We extend this trend to the video domain applying it to the task of action recognition. Employing fractal geometry, we present methods to automatically produce large-scale datasets of short synthetic video clips, which can be utilized for pre-training neural models. The generated video clips are characterized by notable variety, stemmed by the innate ability of fractals to generate complex multi-scale structures. To narrow the domain gap, we further identify key properties of real videos and carefully emulate them during pre-training. Through thorough ablations, we determine the attributes that strengthen downstream results and offer general guidelines for pre-training with synthetic videos. The proposed approach is evaluated by fine-tuning pre-trained models on established action recognition datasets HMDB51 and UCF101 as well as four other video benchmarks related to group action recognition, fine-grained action recognition and dynamic scenes. Compared to standard Kinetics pre-training, our reported results come close and are even superior on a portion of downstream datasets. Code and samples of synthetic videos are available at https://github.com/davidsvy/fractal_video .

翻译：近年来，合成数据在计算机视觉领域日益受到关注，特别是在图像模态预训练方面，其被广泛应用于物体分类、医学影像等多种任务。已有研究表明，通过各类生成过程自动产生的合成样本能够替代真实数据，并形成强大的视觉表征。这种方法有效解决了真实数据在采集标注成本、版权与隐私等方面的固有问题。本研究将这一趋势扩展至视频领域，将其应用于行动识别任务。借助分形几何原理，我们提出了自动生成大规模短合成视频片段数据集的方法，该数据集可用于神经网络模型的预训练。生成视频片段具有显著的多样性特征，这源于分形结构生成复杂多尺度结构的固有特性。为缩小领域差距，我们进一步识别了真实视频的关键属性，并在预训练过程中对其进行精细模拟。通过详尽的消融实验，我们确定了能够增强下游任务性能的数据属性，并为合成视频预训练提供了通用指导原则。本研究提出的方法通过在HMDB51、UCF101等标准行动识别数据集，以及群体行为识别、细粒度行为识别和动态场景相关的四个视频基准数据集上进行微调评估。与基于Kinetics的标准预训练方法相比，本方法在部分下游数据集上取得了相近甚至更优的性能表现。合成视频代码与样本详见https://github.com/davidsvy/fractal_video。