This paper presents WALDO (WArping Layer-Decomposed Objects), a novel approach to the prediction of future video frames from past ones. Individual images are decomposed into multiple layers combining object masks and a small set of control points. The layer structure is shared across all frames in each video to build dense inter-frame connections. Complex scene motions are modeled by combining parametric geometric transformations associated with individual layers, and video synthesis is broken down into discovering the layers associated with past frames, predicting the corresponding transformations for upcoming ones and warping the associated object regions accordingly, and filling in the remaining image parts. Extensive experiments on multiple benchmarks including urban videos (Cityscapes and KITTI) and videos featuring nonrigid motions (UCF-Sports and H3.6M), show that our method consistently outperforms the state of the art by a significant margin in every case. Code, pretrained models, and video samples synthesized by our approach can be found in the project webpage https://16lemoing.github.io/waldo.
翻译:本文提出WALDO(基于变形层分解对象)方法,一种从过去视频帧预测未来帧的新颖框架。单个图像被分解为多个层,每层结合对象掩码与少量控制点。该层结构在所有视频帧中共享,以构建密集的帧间连接。复杂场景运动通过结合与各层相关的参数化几何变换进行建模,视频合成被分解为:发掘与过去帧相关的层结构、预测未来帧对应的变换并据此变形相关对象区域,以及补全剩余图像部分。在包含城市视频(Cityscapes与KITTI)及非刚性运动视频(UCF-Sports与H3.6M)的多个基准数据集上的大量实验表明,本方法在所有案例中均以显著优势持续超越现有最佳技术。代码、预训练模型及本方法合成的视频样本可在项目网页https://16lemoing.github.io/waldo获取。