In this paper, we find that the generation of 3D human motions and 2D human videos is intrinsically coupled. 3D motions provide the structural prior for plausibility and consistency in videos, while pre-trained video models offer strong generalization capabilities for motions. Based on this, we present CoMoVi, a co-generative framework that generates 3D human motions and videos synchronously within a single diffusion denoising loop. However, since the 3D human motions and the 2D human-centric videos have a modality gap between each other, we propose to project the 3D human motion into an effective 2D human motion representation that effectively aligns with the 2D videos. Then, we design a dual-branch diffusion model to couple human motion and the video generation process with mutual feature interaction and 3D-2D cross attentions. To train and evaluate our model, we curate CoMoVi-Dataset, a large-scale real-world human video dataset with text and motion annotations, covering diverse and challenging human motions. Extensive experiments demonstrate that our method generates high-quality 3D human motion with a better generalization ability and that our method can generate high-quality human-centric videos without external motion references.
翻译:本文发现三维人体运动与二维人体视频的生成本质上是耦合的:三维运动为视频的合理性与一致性提供结构先验,而预训练视频模型则为运动赋予强大的泛化能力。基于此,我们提出CoMoVi——一种在单一扩散去噪循环中同步生成三维人体运动与视频的协同生成框架。然而,由于三维人体运动与二维人体视频之间存在模态差异,我们提出将三维人体运动映射为与二维视频有效对齐的二维人体运动表征。随后,我们设计了双分支扩散模型,通过互特征交互与三维-二维交叉注意力机制将人体运动与视频生成过程耦合。为训练与评估模型,我们构建了CoMoVi-Dataset——一个包含文本与运动标注的大规模真实世界人体视频数据集,覆盖多样且具挑战性的人体运动。大量实验表明,本方法能生成具有更强泛化能力的高质量三维人体运动,且无需外部运动参考即可生成高质量人体视频。