Spatio-temporal coherency is a major challenge in synthesizing high quality videos, particularly in synthesizing human videos that contain rich global and local deformations. To resolve this challenge, previous approaches have resorted to different features in the generation process aimed at representing appearance and motion. However, in the absence of strict mechanisms to guarantee such disentanglement, a separation of motion from appearance has remained challenging, resulting in spatial distortions and temporal jittering that break the spatio-temporal coherency. Motivated by this, we here propose LEO, a novel framework for human video synthesis, placing emphasis on spatio-temporal coherency. Our key idea is to represent motion as a sequence of flow maps in the generation process, which inherently isolate motion from appearance. We implement this idea via a flow-based image animator and a Latent Motion Diffusion Model (LMDM). The former bridges a space of motion codes with the space of flow maps, and synthesizes video frames in a warp-and-inpaint manner. LMDM learns to capture motion prior in the training data by synthesizing sequences of motion codes. Extensive quantitative and qualitative analysis suggests that LEO significantly improves coherent synthesis of human videos over previous methods on the datasets TaichiHD, FaceForensics and CelebV-HQ. In addition, the effective disentanglement of appearance and motion in LEO allows for two additional tasks, namely infinite-length human video synthesis, as well as content-preserving video editing.
翻译:时空一致性是合成高质量视频的主要挑战,尤其是在包含丰富全局与局部形变的人类视频合成中。为解决这一挑战,先前方法在生成过程中采用了不同特征以分别表征外观与运动。然而,由于缺乏保证这种解耦的严格机制,将运动从外观中分离仍然具有挑战性,导致空间扭曲和时间抖动,破坏了时空一致性。受此启发,本文提出LEO——一种强调时空一致性的人类视频合成新框架。我们的核心思想是在生成过程中将运动表示为流图序列,这从本质上将运动与外观相隔离。我们通过基于流的图像动画生成器和潜在运动扩散模型(LMDM)实现这一思想。前者将运动编码空间与流图空间相连接,并以“形变-修复”方式合成视频帧;LMDM通过学习合成运动编码序列来捕获训练数据中的运动先验。大量定量与定性分析表明,在TaichiHD、FaceForensics和CelebV-HQ数据集上,LEO相比先前方法显著提升了人类视频的连贯合成质量。此外,LEO对外观与运动的有效解耦使其能够支持两项扩展任务:无限长度人类视频合成以及内容保持的视频编辑。