Audio-driven talking head generation necessitates seamless integration of audio and visual data amidst the challenges posed by diverse input portraits and intricate correlations between audio and facial motions. In response, we propose a robust framework GoHD designed to produce highly realistic, expressive, and controllable portrait videos from any reference identity with any motion. GoHD innovates with three key modules: Firstly, an animation module utilizing latent navigation is introduced to improve the generalization ability across unseen input styles. This module achieves high disentanglement of motion and identity, and it also incorporates gaze orientation to rectify unnatural eye movements that were previously overlooked. Secondly, a conformer-structured conditional diffusion model is designed to guarantee head poses that are aware of prosody. Thirdly, to estimate lip-synchronized and realistic expressions from the input audio within limited training data, a two-stage training strategy is devised to decouple frequent and frame-wise lip motion distillation from the generation of other more temporally dependent but less audio-related motions, e.g., blinks and frowns. Extensive experiments validate GoHD's advanced generalization capabilities, demonstrating its effectiveness in generating realistic talking face results on arbitrary subjects.
翻译:音频驱动的说话头生成需要在应对多样化输入肖像以及音频与面部动作间复杂关联的挑战中,实现音频与视觉数据的无缝整合。为此,我们提出了一个鲁棒的框架GoHD,旨在从任意参考身份和任意动作生成高度逼真、富有表现力且可控的肖像视频。GoHD通过三个关键模块实现创新:首先,引入一个利用潜在导航的动画模块,以提升对未见输入风格的泛化能力。该模块实现了动作与身份的高度解耦,并整合了视线方向以修正先前被忽视的不自然眼球运动。其次,设计了一个基于Conformer结构的条件扩散模型,以确保头部姿态能够感知韵律。第三,为了在有限的训练数据下从输入音频中估计出唇形同步且真实的表情,我们设计了一种两阶段训练策略,将频繁且逐帧的唇部运动提炼与其他在时间上更具依赖性但与音频关联较弱的动作(例如眨眼和皱眉)的生成过程解耦。大量实验验证了GoHD先进的泛化能力,证明了其在任意对象上生成真实说话面部效果的有效性。