Video prediction aims to predict future frames from a video's previous content. Existing methods mainly process video data where the time dimension mingles with the space and channel dimensions from three distinct angles: as a sequence of individual frames, as a 3D volume in spatiotemporal coordinates, or as a stacked image where frames are treated as separate channels. Most of them generally focus on one of these perspectives and may fail to fully exploit the relationships across different dimensions. To address this issue, this paper introduces a convolutional mixer for video prediction, termed ViP-Mixer, to model the spatiotemporal evolution in the latent space of an autoencoder. The ViP-Mixers are stacked sequentially and interleave feature mixing at three levels: frames, channels, and locations. Extensive experiments demonstrate that our proposed method achieves new state-of-the-art prediction performance on three benchmark video datasets covering both synthetic and real-world scenarios.
翻译:视频预测旨在从视频的先前内容中预测未来帧。现有方法主要从三个不同角度处理时间维度与空间及通道维度交织的视频数据:作为独立帧序列、作为时空坐标中的三维体、或作为将帧视为独立通道的堆叠图像。大多数方法通常专注于其中一种视角,可能无法充分挖掘不同维度间的关联。为解决这一问题,本文提出一种用于视频预测的卷积混合器——ViP-Mixer,用于在自编码器的潜在空间中建模时空演化。ViP-Mixer通过顺序堆叠,在帧、通道和位置三个层面交错进行特征混合。大量实验证明,我们的方法在覆盖合成场景和真实场景的三个基准视频数据集上取得了新的最优预测性能。