Existing audio-driven facial animation methods face critical challenges, including expression leakage, ineffective subtle expression transfer, and imprecise audio-driven synchronization. We discovered that these issues stem from limitations in motion representation and the lack of fine-grained control over facial expressions. To address these problems, we present Takin-ADA, a novel two-stage approach for real-time audio-driven portrait animation. In the first stage, we introduce a specialized loss function that enhances subtle expression transfer while reducing unwanted expression leakage. The second stage utilizes an advanced audio processing technique to improve lip-sync accuracy. Our method not only generates precise lip movements but also allows flexible control over facial expressions and head motions. Takin-ADA achieves high-resolution (512x512) facial animations at up to 42 FPS on an RTX 4090 GPU, outperforming existing commercial solutions. Extensive experiments demonstrate that our model significantly surpasses previous methods in video quality, facial dynamics realism, and natural head movements, setting a new benchmark in the field of audio-driven facial animation.
翻译:现有的音频驱动面部动画方法面临关键挑战,包括表情泄露、细微表情传递失效以及音频驱动同步不精确。我们发现这些问题源于运动表示的局限性以及对面部表情缺乏细粒度控制。为解决这些问题,我们提出了Takin-ADA,一种用于实时音频驱动肖像动画的新型两阶段方法。在第一阶段,我们引入了一种专门的损失函数,该函数在增强细微表情传递的同时减少了不期望的表情泄露。第二阶段采用先进的音频处理技术以提高唇形同步的准确性。我们的方法不仅能生成精确的唇部动作,还能灵活控制面部表情和头部运动。Takin-ADA在RTX 4090 GPU上实现了高达42 FPS的高分辨率(512x512)面部动画,性能优于现有的商业解决方案。大量实验表明,我们的模型在视频质量、面部动态真实感和头部运动自然度方面显著超越了先前的方法,为音频驱动面部动画领域树立了新的标杆。