We offer a new perspective on approaching the task of video generation. Instead of directly synthesizing a sequence of frames, we propose to render a video by warping one static image with a generative deformation field (GenDeF). Such a pipeline enjoys three appealing advantages. First, we can sufficiently reuse a well-trained image generator to synthesize the static image (also called canonical image), alleviating the difficulty in producing a video and thereby resulting in better visual quality. Second, we can easily convert a deformation field to optical flows, making it possible to apply explicit structural regularizations for motion modeling, leading to temporally consistent results. Third, the disentanglement between content and motion allows users to process a synthesized video through processing its corresponding static image without any tuning, facilitating many applications like video editing, keypoint tracking, and video segmentation. Both qualitative and quantitative results on three common video generation benchmarks demonstrate the superiority of our GenDeF method.
翻译:我们为视频生成任务提供了一种新视角。不同于直接合成帧序列,我们提出通过生成式形变场(GenDeF)对单张静态图像进行形变来渲染一段视频。该流程具有三个吸引人的优势。首先,我们可以充分复用训练良好的图像生成器来合成静态图像(亦称规范图像),从而降低视频生成的难度,并因此获得更好的视觉质量。第二,我们可以轻易地将形变场转换为光流,从而能够对运动建模应用显式的结构化正则化,产生时间上一致的结果。第三,内容与运动之间的解耦允许用户通过处理对应的静态图像来处理合成视频,而无需任何调节,这方便了视频编辑、关键点跟踪和视频分割等多种应用。在三个常见视频生成基准上的定性和定量结果均证明了我们GenDeF方法的优越性。