We address the video prediction task by putting forth a novel model that combines (i) a novel hierarchical residual learning vector quantized variational autoencoder (HR-VQVAE), and (ii) a novel autoregressive spatiotemporal predictive model (AST-PM). We refer to this approach as a sequential hierarchical residual learning vector quantized variational autoencoder (S-HR-VQVAE). By leveraging the intrinsic capabilities of HR-VQVAE at modeling still images with a parsimonious representation, combined with the AST-PM's ability to handle spatiotemporal information, S-HR-VQVAE can better deal with major challenges in video prediction. These include learning spatiotemporal information, handling high dimensional data, combating blurry prediction, and implicit modeling of physical characteristics. Extensive experimental results on four challenging tasks, namely KTH Human Action, TrafficBJ, Human3.6M, and Kitti, demonstrate that our model compares favorably against state-of-the-art video prediction techniques both in quantitative and qualitative evaluations despite a much smaller model size. Finally, we boost S-HR-VQVAE by proposing a novel training method to jointly estimate the HR-VQVAE and AST-PM parameters.
翻译:我们通过提出一种新颖模型来解决视频预测任务,该模型结合了(i)一种新颖的分层残差学习向量量化变分自编码器(HR-VQVAE),以及(ii)一种新颖的自回归时空预测模型(AST-PM)。我们将此方法称为序列化分层残差学习向量量化变分自编码器(S-HR-VQVAE)。通过利用HR-VQVAE以简约表示对静态图像建模的内在能力,并结合AST-PM处理时空信息的能力,S-HR-VQVAE能够更好地应对视频预测中的主要挑战。这些挑战包括学习时空信息、处理高维数据、对抗模糊预测以及对物理特性的隐式建模。在四个具有挑战性的任务(即KTH Human Action、TrafficBJ、Human3.6M和Kitti)上进行的大量实验结果表明,尽管模型尺寸小得多,我们的模型在定量和定性评估中均优于最先进的视频预测技术。最后,我们通过提出一种新颖的训练方法来联合估计HR-VQVAE和AST-PM的参数,从而提升了S-HR-VQVAE的性能。