Hallo2: Long-Duration and High-Resolution Audio-Driven Portrait Image Animation

Recent advances in latent diffusion-based generative models for portrait image animation, such as Hallo, have achieved impressive results in short-duration video synthesis. In this paper, we present updates to Hallo, introducing several design enhancements to extend its capabilities. First, we extend the method to produce long-duration videos. To address substantial challenges such as appearance drift and temporal artifacts, we investigate augmentation strategies within the image space of conditional motion frames. Specifically, we introduce a patch-drop technique augmented with Gaussian noise to enhance visual consistency and temporal coherence over long duration. Second, we achieve 4K resolution portrait video generation. To accomplish this, we implement vector quantization of latent codes and apply temporal alignment techniques to maintain coherence across the temporal dimension. By integrating a high-quality decoder, we realize visual synthesis at 4K resolution. Third, we incorporate adjustable semantic textual labels for portrait expressions as conditional inputs. This extends beyond traditional audio cues to improve controllability and increase the diversity of the generated content. To the best of our knowledge, Hallo2, proposed in this paper, is the first method to achieve 4K resolution and generate hour-long, audio-driven portrait image animations enhanced with textual prompts. We have conducted extensive experiments to evaluate our method on publicly available datasets, including HDTF, CelebV, and our introduced "Wild" dataset. The experimental results demonstrate that our approach achieves state-of-the-art performance in long-duration portrait video animation, successfully generating rich and controllable content at 4K resolution for duration extending up to tens of minutes. Project page https://fudan-generative-vision.github.io/hallo2

翻译：基于潜在扩散的人像图像动画生成模型（如Hallo）近期取得显著进展，在短时视频合成中展现出令人印象深刻的效果。本文提出对Hallo模型的改进，通过引入多项设计增强以扩展其能力。首先，我们将方法扩展至生成长时视频。为应对外观漂移和时间伪影等重大挑战，我们在条件运动帧的图像空间内研究增强策略。具体而言，我们提出结合高斯噪声的块丢弃技术，以增强长时视频的视觉一致性与时间连贯性。其次，我们实现了4K分辨率人像视频生成。为此，我们采用潜在编码的向量量化技术，并应用时间对齐方法以保持时间维度上的连贯性。通过集成高质量解码器，我们实现了4K分辨率的视觉合成。第三，我们引入可调节的语义文本标签作为人像表情的条件输入。这超越了传统音频线索的范畴，提升了生成内容的可控性与多样性。据我们所知，本文提出的Hallo2是首个实现4K分辨率、并能生成结合文本提示增强的时长可达数小时的音频驱动人像动画的方法。我们在公开数据集（包括HDTF、CelebV及我们提出的"Wild"数据集）上进行了大量实验评估。实验结果表明，我们的方法在长时人像视频动画中达到最先进性能，成功生成分辨率达4K、时长可达数十分钟的丰富可控内容。项目页面 https://fudan-generative-vision.github.io/hallo2