Existing methodologies for animating portrait images face significant challenges, particularly in handling non-frontal perspectives, rendering dynamic objects around the portrait, and generating immersive, realistic backgrounds. In this paper, we introduce the first application of a pretrained transformer-based video generative model that demonstrates strong generalization capabilities and generates highly dynamic, realistic videos for portrait animation, effectively addressing these challenges. The adoption of a new video backbone model makes previous U-Net-based methods for identity maintenance, audio conditioning, and video extrapolation inapplicable. To address this limitation, we design an identity reference network consisting of a causal 3D VAE combined with a stacked series of transformer layers, ensuring consistent facial identity across video sequences. Additionally, we investigate various speech audio conditioning and motion frame mechanisms to enable the generation of continuous video driven by speech audio. Our method is validated through experiments on benchmark and newly proposed wild datasets, demonstrating substantial improvements over prior methods in generating realistic portraits characterized by diverse orientations within dynamic and immersive scenes. Further visualizations and the source code are available at: https://fudan-generative-vision.github.io/hallo3/.
翻译:现有的肖像图像动画方法面临显著挑战,特别是在处理非正面视角、渲染肖像周围的动态物体以及生成沉浸式逼真背景方面。本文首次引入了一种基于预训练Transformer的视频生成模型的应用,该模型展现出强大的泛化能力,可为肖像动画生成高度动态且逼真的视频,从而有效应对上述挑战。新型视频骨干模型的采用使得先前基于U-Net的身份保持、音频条件化和视频外推方法不再适用。为解决此限制,我们设计了一个由因果3D VAE与堆叠Transformer层组成的身份参考网络,确保视频序列中面部身份的一致性。此外,我们研究了多种语音音频条件化与运动帧机制,以实现由语音音频驱动的连续视频生成。通过在基准数据集及新提出的野外数据集上进行实验验证,我们的方法在生成以动态沉浸式场景中多样化朝向为特征的逼真肖像方面,相较现有方法展现出显著提升。更多可视化结果及源代码发布于:https://fudan-generative-vision.github.io/hallo3/。