JoyVASA：基于扩散模型的音频驱动面部动态与头部运动生成的人像与动物图像动画 (JoyVASA: Portrait and Animal Image Animation with Diffusion-Based Audio-Driven Facial Dynamics and Head Motion Generation)

Audio-driven portrait animation has made significant advances with diffusion-based models, improving video quality and lipsync accuracy. However, the increasing complexity of these models has led to inefficiencies in training and inference, as well as constraints on video length and inter-frame continuity. In this paper, we propose JoyVASA, a diffusion-based method for generating facial dynamics and head motion in audio-driven facial animation. Specifically, in the first stage, we introduce a decoupled facial representation framework that separates dynamic facial expressions from static 3D facial representations. This decoupling allows the system to generate longer videos by combining any static 3D facial representation with dynamic motion sequences. Then, in the second stage, a diffusion transformer is trained to generate motion sequences directly from audio cues, independent of character identity. Finally, a generator trained in the first stage uses the 3D facial representation and the generated motion sequences as inputs to render high-quality animations. With the decoupled facial representation and the identity-independent motion generation process, JoyVASA extends beyond human portraits to animate animal faces seamlessly. The model is trained on a hybrid dataset of private Chinese and public English data, enabling multilingual support. Experimental results validate the effectiveness of our approach. Future work will focus on improving real-time performance and refining expression control, further expanding the applications in portrait animation. The code will be available at: https://jdhalgo.github.io/JoyVASA.

翻译：基于扩散模型的音频驱动人像动画技术已取得显著进展，视频质量与唇形同步精度不断提升。然而，模型复杂度的日益增加导致训练与推理效率低下，同时视频长度与帧间连续性也受到限制。本文提出JoyVASA，一种基于扩散模型的音频驱动面部动画方法，用于生成面部动态与头部运动。具体而言，在第一阶段，我们引入解耦的面部表征框架，将动态面部表情从静态三维面部表征中分离。这种解耦使得系统能够通过组合任意静态三维面部表征与动态运动序列来生成更长的视频。随后在第二阶段，训练一个扩散Transformer模型直接从音频线索生成运动序列，该过程独立于角色身份。最后，第一阶段训练完成的生成器以三维面部表征和生成的运动序列作为输入，渲染出高质量的动画。凭借解耦的面部表征和身份无关的运动生成过程，JoyVASA不仅适用于人像动画，还能无缝扩展到动物面部动画。模型在包含私有中文数据与公开英文数据的混合数据集上进行训练，具备多语言支持能力。实验结果验证了本方法的有效性。未来工作将聚焦于提升实时性能与优化表情控制，进一步拓展人像动画的应用场景。代码将在以下网址公开：https://jdhalgo.github.io/JoyVASA。