Video diffusion models are able to generate high-quality videos by learning strong spatial-temporal priors on large-scale datasets. In this paper, we aim to investigate whether such priors derived from a generative process are suitable for video recognition, and eventually joint optimization of generation and recognition. Building upon Stable Video Diffusion, we introduce GenRec, the first unified framework trained with a random-frame conditioning process so as to learn generalized spatial-temporal representations. The resulting framework can naturally supports generation and recognition, and more importantly is robust even when visual inputs contain limited information. Extensive experiments demonstrate the efficacy of GenRec for both recognition and generation. In particular, GenRec achieves competitive recognition performance, offering 75.8% and 87.2% accuracy on SSV2 and K400, respectively. GenRec also performs the best class-conditioned image-to-video generation results, achieving 46.5 and 49.3 FVD scores on SSV2 and EK-100 datasets. Furthermore, GenRec demonstrates extraordinary robustness in scenarios that only limited frames can be observed.
翻译:视频扩散模型能够通过在大规模数据集上学习强大的时空先验来生成高质量视频。本文旨在探究此类源自生成过程的先验是否适用于视频识别任务,并最终实现生成与识别的联合优化。基于Stable Video Diffusion,我们提出了首个通过随机帧条件化过程进行训练的通用框架GenRec,以学习泛化的时空表征。该框架天然支持生成与识别任务,更重要的是在视觉输入信息受限时仍能保持鲁棒性。大量实验验证了GenRec在识别与生成任务上的有效性。具体而言,GenRec在SSV2和K400数据集上分别达到75.8%和87.2%的识别准确率,展现出具有竞争力的识别性能。在类别条件化图像到视频生成任务中,GenRec在SSV2和EK-100数据集上分别取得46.5和49.3的FVD分数,达到当前最佳生成效果。此外,GenRec在仅能观测有限帧数的场景中表现出卓越的鲁棒性。