A central challenge of video prediction lies where the system has to reason the objects' future motions from image frames while simultaneously maintaining the consistency of their appearances across frames. This work introduces an end-to-end trainable two-stream video prediction framework, Motion-Matrix-based Video Prediction (MMVP), to tackle this challenge. Unlike previous methods that usually handle motion prediction and appearance maintenance within the same set of modules, MMVP decouples motion and appearance information by constructing appearance-agnostic motion matrices. The motion matrices represent the temporal similarity of each and every pair of feature patches in the input frames, and are the sole input of the motion prediction module in MMVP. This design improves video prediction in both accuracy and efficiency, and reduces the model size. Results of extensive experiments demonstrate that MMVP outperforms state-of-the-art systems on public data sets by non-negligible large margins (about 1 db in PSNR, UCF Sports) in significantly smaller model sizes (84% the size or smaller).
翻译:摘要:视频预测的核心挑战在于系统需从图像帧中推理物体的未来运动,同时保持其外观在帧间的一致性。本文提出一种端到端可训练的双流视频预测框架——基于运动矩阵的视频预测(MMVP),以应对这一挑战。与以往通常在同一模块组内处理运动预测与外观维持的方法不同,MMVP通过构建与外观无关的运动矩阵来解耦运动与外观信息。运动矩阵表示输入帧中每一对特征补丁的时间相似性,并作为MMVP运动预测模块的唯一输入。这一设计在提升视频预测准确性与效率的同时,减小了模型规模。大量实验结果表明,MMVP在公开数据集上以显著更小的模型尺寸(为原尺寸的84%或更小)超越了现有最优系统,取得了不可忽视的大幅性能提升(例如,在UCF Sports数据集上PSNR提升约1 dB)。