This paper presents WALDO (WArping Layer-Decomposed Objects), a novel approach to the prediction of future video frames from past ones. Individual images are decomposed into multiple layers combining object masks and a small set of control points. The layer structure is shared across all frames in each video to build dense inter-frame connections. Complex scene motions are modeled by combining parametric geometric transformations associated with individual layers, and video synthesis is broken down into discovering the layers associated with past frames, predicting the corresponding transformations for upcoming ones and warping the associated object regions accordingly, and filling in the remaining image parts. Extensive experiments on multiple benchmarks including urban videos (Cityscapes and KITTI) and videos featuring nonrigid motions (UCF-Sports and H3.6M), show that our method consistently outperforms the state of the art by a significant margin in every case. Code, pretrained models, and video samples synthesized by our approach can be found in the project webpage https://16lemoing.github.io/waldo.
翻译:本文提出了WALDO(WArping Layer-Decomposed Objects),一种从过去视频帧预测未来帧的新颖方法。单个图像被分解为多个层,每层结合对象掩码和少量控制点。该层结构在所有帧间共享,以构建密集的帧间连接。复杂场景运动通过结合与各层相关的参数化几何变换来建模,而视频合成则分解为:发现与过去帧相关的层、预测下一帧对应的变换并相应变形关联的对象区域、以及填充剩余图像部分。在多个基准测试(包括城市视频Cityscapes与KITTI,以及非刚性运动视频UCF-Sports与H3.6M)上的大量实验表明,我们的方法在每种情况下均显著优于现有技术。代码、预训练模型及本方法合成的视频样本可在项目网页 https://16lemoing.github.io/waldo 获取。