The performance of video prediction has been greatly boosted by advanced deep neural networks. However, most of the current methods suffer from large model sizes and require extra inputs, e.g., semantic/depth maps, for promising performance. For efficiency consideration, in this paper, we propose a Dynamic Multi-scale Voxel Flow Network (DMVFN) to achieve better video prediction performance at lower computational costs with only RGB images, than previous methods. The core of our DMVFN is a differentiable routing module that can effectively perceive the motion scales of video frames. Once trained, our DMVFN selects adaptive sub-networks for different inputs at the inference stage. Experiments on several benchmarks demonstrate that our DMVFN is an order of magnitude faster than Deep Voxel Flow and surpasses the state-of-the-art iterative-based OPT on generated image quality. Our code and demo are available at https://huxiaotaostasy.github.io/DMVFN/.
翻译:视频预测的性能已通过先进深度神经网络得到极大提升。然而,现有大多数方法存在模型规模庞大、且需依赖额外输入(如语义/深度图)才能获得理想性能的问题。出于效率考虑,本文提出一种动态多尺度体素流网络(DMVFN),仅需RGB图像即可在更低计算成本下实现优于先前方法的视频预测性能。DMVFN的核心是一个可微分路由模块,能有效感知视频帧的运动尺度。训练完成后,DMVFN可在推理阶段为不同输入自适应选择子网络。多个基准实验表明,DMVFN比Deep Voxel Flow快一个数量级,且在生成图像质量上超越了当前最先进的迭代方法OPT。我们的代码和演示可在 https://huxiaotaostasy.github.io/DMVFN/ 获取。