Audio-driven portrait animation aims to synthesize realistic and natural talking head videos from an input audio signal and a single reference image. While existing methods achieve high-quality results by leveraging high-dimensional intermediate representations and explicitly modeling motion dynamics, their computational complexity renders them unsuitable for real-time deployment. Real-time inference imposes stringent latency and memory constraints, often necessitating the use of highly compressed latent representations. However, operating in such compact spaces hinders the preservation of fine-grained spatiotemporal details, thereby complicating audio-visual synchronization RAP (Real-time Audio-driven Portrait animation), a unified framework for generating high-quality talking portraits under real-time constraints. Specifically, RAP introduces a hybrid attention mechanism for fine-grained audio control, and a static-dynamic training-inference paradigm that avoids explicit motion supervision. Through these techniques, RAP achieves precise audio-driven control, mitigates long-term temporal drift, and maintains high visual fidelity. Extensive experiments demonstrate that RAP achieves state-of-the-art performance while operating under real-time constraints.
翻译:音频驱动肖像动画旨在从输入的音频信号和单张参考图像中合成逼真自然的说话头部视频。现有方法通过利用高维中间表示并显式建模运动动态,能够实现高质量的结果,但其计算复杂度使其难以适用于实时部署。实时推理对延迟和内存施加了严格限制,通常需要使用高度压缩的潜在表示。然而,在此类紧凑空间中操作会阻碍细粒度时空细节的保留,从而使得视听同步复杂化。本文提出RAP(实时音频驱动肖像动画),一种在实时约束下生成高质量说话肖像的统一框架。具体而言,RAP引入了用于细粒度音频控制的混合注意力机制,以及一种避免显式运动监督的静态-动态训练-推断范式。通过这些技术,RAP实现了精确的音频驱动控制,缓解了长期时间漂移,并保持了高视觉保真度。大量实验表明,RAP在实时约束下运行的同时,实现了最先进的性能。