In this paper, we find that the generation of 3D human motions and 2D human videos is intrinsically coupled. 3D motions provide the structural prior for plausibility and consistency in videos, while pre-trained video models offer strong generalization capabilities for motions, which necessitate coupling their generation processes. Based on this, we present CoMoVi, a co-generative framework that couples two video diffusion models (VDMs) to generate 3D human motions and videos synchronously within a single diffusion denoising loop. To achieve this, we first propose an effective 2D human motion representation that can inherit the powerful prior of pre-trained VDMs. Then, we design a dual-branch diffusion model to couple human motion and video generation process with mutual feature interaction and 3D-2D cross attentions. Moreover, we curate CoMoVi Dataset, a large-scale real-world human video dataset with text and motion annotations, covering diverse and challenging human motions. Extensive experiments demonstrate the effectiveness of our method in both 3D human motion and video generation tasks.
翻译:本文发现三维人体动作与二维人体视频的生成具有内在耦合性。三维动作为视频的合理性与一致性提供结构先验,而预训练视频模型则为动作生成提供强大的泛化能力,这要求将两者的生成过程进行耦合。基于此,我们提出CoMoVi——一种协同生成框架,通过耦合两个视频扩散模型,在单一扩散去噪循环中同步生成三维人体动作与视频。为实现这一目标,我们首先提出一种有效的二维人体动作表征方法,使其能够继承预训练视频扩散模型的强大先验。随后,我们设计了一个双分支扩散模型,通过特征交互与三维-二维交叉注意力机制,耦合人体动作与视频的生成过程。此外,我们构建了CoMoVi数据集——一个包含文本与动作标注的大规模真实世界人体视频数据集,涵盖多样化且具有挑战性的人体动作。大量实验证明了我们的方法在三维人体动作生成与视频生成任务中的有效性。