Predicting future frames of a video is challenging because it is difficult to learn the uncertainty of the underlying factors influencing their contents. In this paper, we propose a novel video prediction model, which has infinite-dimensional latent variables over the spatio-temporal domain. Specifically, we first decompose the video motion and content information, then take a neural stochastic differential equation to predict the temporal motion information, and finally, an image diffusion model autoregressively generates the video frame by conditioning on the predicted motion feature and the previous frame. The better expressiveness and stronger stochasticity learning capability of our model lead to state-of-the-art video prediction performances. As well, our model is able to achieve temporal continuous prediction, i.e., predicting in an unsupervised way the future video frames with an arbitrarily high frame rate. Our code is available at \url{https://github.com/XiYe20/STDiffProject}.
翻译:视频未来帧预测具有挑战性,因为难以学习影响内容潜在因素的不确定性。本文提出一种新颖的视频预测模型,该模型在时空域中具有无限维潜变量。具体而言,我们首先分解视频运动与内容信息,随后采用神经随机微分方程预测时域运动信息,最后通过图像扩散模型将预测的运动特征与前一帧作为条件,自回归地生成视频帧。得益于更强的表示能力与随机性学习能力,所提模型取得了当前最优的视频预测性能。此外,该模型可实现时间连续预测,即能以任意高帧率无监督地预测未来视频帧。我们的代码开源在\url{https://github.com/XiYe20/STDiffProject}。