Existing methodologies for animating portrait images face significant challenges, particularly in handling non-frontal perspectives, rendering dynamic objects around the portrait, and generating immersive, realistic backgrounds. In this paper, we introduce the first application of a pretrained transformer-based video generative model that demonstrates strong generalization capabilities and generates highly dynamic, realistic videos for portrait animation, effectively addressing these challenges. The adoption of a new video backbone model makes previous U-Net-based methods for identity maintenance, audio conditioning, and video extrapolation inapplicable. To address this limitation, we design an identity reference network consisting of a causal 3D VAE combined with a stacked series of transformer layers, ensuring consistent facial identity across video sequences. Additionally, we investigate various speech audio conditioning and motion frame mechanisms to enable the generation of continuous video driven by speech audio. Our method is validated through experiments on benchmark and newly proposed wild datasets, demonstrating substantial improvements over prior methods in generating realistic portraits characterized by diverse orientations within dynamic and immersive scenes. Further visualizations and the source code are available at: https://github.com/fudan-generative-vision/hallo3.
翻译:现有的人像图像动画方法面临显著挑战,尤其是在处理非正面视角、渲染人像周围的动态物体以及生成沉浸式逼真背景方面。本文首次引入了一种基于预训练Transformer的视频生成模型的应用,该模型展现出强大的泛化能力,并能为人像动画生成高度动态且逼真的视频,从而有效应对这些挑战。采用新的视频骨干网络使得先前基于U-Net的身份保持、音频条件化和视频外推方法不再适用。为解决这一局限,我们设计了一个身份参考网络,该网络由因果3D VAE与堆叠的Transformer层组合而成,确保视频序列中面部身份的一致性。此外,我们研究了多种语音音频条件化和运动帧机制,以实现由语音音频驱动的连续视频生成。通过在基准数据集和新提出的野外数据集上进行实验,验证了我们的方法,结果表明在生成以动态沉浸式场景中多样化朝向为特征的逼真人像方面,相较于先前方法有显著提升。更多可视化结果及源代码发布于:https://github.com/fudan-generative-vision/hallo3。