We propose VLOGGER, a method for audio-driven human video generation from a single input image of a person, which builds on the success of recent generative diffusion models. Our method consists of 1) a stochastic human-to-3d-motion diffusion model, and 2) a novel diffusion-based architecture that augments text-to-image models with both spatial and temporal controls. This supports the generation of high quality video of variable length, easily controllable through high-level representations of human faces and bodies. In contrast to previous work, our method does not require training for each person, does not rely on face detection and cropping, generates the complete image (not just the face or the lips), and considers a broad spectrum of scenarios (e.g. visible torso or diverse subject identities) that are critical to correctly synthesize humans who communicate. We also curate MENTOR, a new and diverse dataset with 3d pose and expression annotations, one order of magnitude larger than previous ones (800,000 identities) and with dynamic gestures, on which we train and ablate our main technical contributions. VLOGGER outperforms state-of-the-art methods in three public benchmarks, considering image quality, identity preservation and temporal consistency while also generating upper-body gestures. We analyze the performance of VLOGGER with respect to multiple diversity metrics, showing that our architectural choices and the use of MENTOR benefit training a fair and unbiased model at scale. Finally we show applications in video editing and personalization.
翻译:我们提出VLOGGER方法,该方法基于近期生成式扩散模型的成功,可从单张人物输入图像生成音频驱动的视频。该方法包含:1)随机人体至三维运动扩散模型;2)基于扩散的新型架构,通过空间与时间双重控制增强文生图模型。该技术可生成任意长度的优质视频,并通过高层级的人脸与人体表征实现便捷控制。与现有方法相比,本方法无需针对个体训练,不依赖人脸检测与裁剪,可生成完整图像(而非仅面部或嘴唇区域),并考虑了正确合成交流人类所需的关键场景(如可见躯干或多样化主体身份)。我们还构建了MENTOR数据集——包含三维姿态与表情标注的新颖多样化数据集,其规模(80万个身份)与动态手势内容丰富度均较现有数据集提升一个数量级,用于训练与消融主要技术贡献。VLOGGER在三个公开基准测试中,在图像质量、身份保持及时间一致性方面均超越现有最优方法,同时能生成上半身手势。我们通过多项多样性指标分析VLOGGER性能,证明架构选择与MENTOR数据集的使用有助于在规模化训练中构建公平无偏的模型。最后展示其在视频编辑与个性化领域的应用。