A central challenge of video prediction lies where the system has to reason the objects' future motions from image frames while simultaneously maintaining the consistency of their appearances across frames. This work introduces an end-to-end trainable two-stream video prediction framework, Motion-Matrix-based Video Prediction (MMVP), to tackle this challenge. Unlike previous methods that usually handle motion prediction and appearance maintenance within the same set of modules, MMVP decouples motion and appearance information by constructing appearance-agnostic motion matrices. The motion matrices represent the temporal similarity of each and every pair of feature patches in the input frames, and are the sole input of the motion prediction module in MMVP. This design improves video prediction in both accuracy and efficiency, and reduces the model size. Results of extensive experiments demonstrate that MMVP outperforms state-of-the-art systems on public data sets by non-negligible large margins (about 1 db in PSNR, UCF Sports) in significantly smaller model sizes (84% the size or smaller). Please refer to https://github.com/Kay1794/MMVP-motion-matrix-based-video-prediction for the official code and the datasets used in this paper.
翻译:视频预测的核心挑战在于系统需从图像帧中推理物体的未来运动,同时保持其外观在帧间的一致性。本文提出一种端到端可训练的双流视频预测框架——基于运动矩阵的视频预测(MMVP),以应对该挑战。与以往通常在同组模块内处理运动预测与外观保持的方法不同,MMVP通过构建与外观无关的运动矩阵来解耦运动与外观信息。运动矩阵表示输入帧中每对特征图块的时间相似性,是MMVP运动预测模块的唯一输入。该设计在提升视频预测精度与效率的同时缩减了模型规模。大量实验结果表明,MMVP在显著更小的模型尺寸(84%或更小)下,以不可忽视的大幅优势(在UCF Sports数据集上PSNR提升约1dB)超越了现有公开数据集上的最优系统。官方代码及本文使用的数据集请参见https://github.com/Kay1794/MMVP-motion-matrix-based-video-prediction。