We address the video prediction task by putting forth a novel model that combines (i) our recently proposed hierarchical residual vector quantized variational autoencoder (HR-VQVAE), and (ii) a novel spatiotemporal PixelCNN (ST-PixelCNN). We refer to this approach as a sequential hierarchical residual learning vector quantized variational autoencoder (S-HR-VQVAE). By leveraging the intrinsic capabilities of HR-VQVAE at modeling still images with a parsimonious representation, combined with the ST-PixelCNN's ability at handling spatiotemporal information, S-HR-VQVAE can better deal with chief challenges in video prediction. These include learning spatiotemporal information, handling high dimensional data, combating blurry prediction, and implicit modeling of physical characteristics. Extensive experimental results on the KTH Human Action and Moving-MNIST tasks demonstrate that our model compares favorably against top video prediction techniques both in quantitative and qualitative evaluations despite a much smaller model size. Finally, we boost S-HR-VQVAE by proposing a novel training method to jointly estimate the HR-VQVAE and ST-PixelCNN parameters.
翻译:我们提出了一种融合(i)近期提出的分层残差矢量量化变分自编码器(HR-VQVAE)与(ii)新型时空像素CNN(ST-PixelCNN)的模型,用于解决视频预测任务。我们将该方法称为序列化分层残差学习矢量量化变分自编码器(S-HR-VQVAE)。通过利用HR-VQVAE对静态图像进行简约表征的固有能力,结合ST-PixelCNN处理时空信息的优势,S-HR-VQVAE能更有效地应对视频预测中的核心挑战,包括学习时空信息、处理高维数据、克服模糊预测以及物理特性的隐式建模。在KTH人体动作和移动MNIST任务上的大量实验表明:尽管模型尺寸显著更小,但我们的模型在定量与定性评估中均优于主流视频预测技术。最后,我们提出了一种联合估计HR-VQVAE与ST-PixelCNN参数的新型训练方法,进一步提升了S-HR-VQVAE的性能。