Spatio-temporal coherency is a major challenge in synthesizing high quality videos, particularly in synthesizing human videos that contain rich global and local deformations. To resolve this challenge, previous approaches have resorted to different features in the generation process aimed at representing appearance and motion. However, in the absence of strict mechanisms to guarantee such disentanglement, a separation of motion from appearance has remained challenging, resulting in spatial distortions and temporal jittering that break the spatio-temporal coherency. Motivated by this, we here propose LEO, a novel framework for human video synthesis, placing emphasis on spatio-temporal coherency. Our key idea is to represent motion as a sequence of flow maps in the generation process, which inherently isolate motion from appearance. We implement this idea via a flow-based image animator and a Latent Motion Diffusion Model (LMDM). The former bridges a space of motion codes with the space of flow maps, and synthesizes video frames in a warp-and-inpaint manner. LMDM learns to capture motion prior in the training data by synthesizing sequences of motion codes. Extensive quantitative and qualitative analysis suggests that LEO significantly improves coherent synthesis of human videos over previous methods on the datasets TaichiHD, FaceForensics and CelebV-HQ. In addition, the effective disentanglement of appearance and motion in LEO allows for two additional tasks, namely infinite-length human video synthesis, as well as content-preserving video editing.
翻译:时空连贯性是高质量视频合成的主要挑战,尤其在合成包含丰富全局和局部变形的人体视频时。为解决这一挑战,先前方法在生成过程中采用了不同特征来表示外观和运动。然而,由于缺乏严格的机制来保证这种解耦,运动与外观的分离仍然具有挑战性,导致空间失真和时间抖动,破坏了时空连贯性。受此启发,本文提出LEO——一种专注于时空连贯性的人体视频合成新框架。我们的核心思想是在生成过程中将运动表示为一系列光流图,从而从本质上分离运动与外观。我们通过基于流的图像动画器和潜在运动扩散模型(LMDM)实现这一思想:前者在运动编码空间与光流图空间之间建立桥梁,并以“扭曲+补全”的方式合成视频帧;后者通过学习生成运动编码序列来捕获训练数据中的运动先验。大量定量和定性分析表明,在TaichiHD、FaceForensics和CelebV-HQ数据集上,LEO相比先前方法显著提升了人体视频合成的连贯性。此外,LEO中外观与运动的有效解耦使其能够支持两项额外任务:无限长度人体视频合成以及内容保持的视频编辑。