We present an audio-driven real-time system for animating photorealistic 3D facial avatars with minimal latency, designed for social interactions in virtual reality for anyone. Central to our approach is an encoder model that transforms audio signals into latent facial expression sequences in real time, which are then decoded as photorealistic 3D facial avatars. Leveraging the generative capabilities of diffusion models, we capture the rich spectrum of facial expressions necessary for natural communication while achieving real-time performance (<15ms GPU time). Our novel architecture minimizes latency through two key innovations: an online transformer that eliminates dependency on future inputs and a distillation pipeline that accelerates iterative denoising into a single step. We further address critical design challenges in live scenarios for processing continuous audio signals frame-by-frame while maintaining consistent animation quality. The versatility of our framework extends to multimodal applications, including semantic modalities such as emotion conditions and multimodal sensors with head-mounted eye cameras on VR headsets. Experimental results demonstrate significant improvements in facial animation accuracy over existing offline state-of-the-art baselines, achieving 100 to 1000 times faster inference speed. We validate our approach through live VR demonstrations and across various scenarios such as multilingual speeches.
翻译:我们提出了一种音频驱动的实时系统,能够以极低延迟为逼真的三维面部虚拟形象生成动画,旨在为任何人在虚拟现实中实现社交互动。我们方法的核心是一个编码器模型,它实时将音频信号转换为潜在的面部表情序列,随后解码为逼真的三维面部虚拟形象。利用扩散模型的生成能力,我们捕捉了自然交流所需的面部表情丰富谱系,同时实现了实时性能(GPU处理时间<15毫秒)。我们的新颖架构通过两项关键创新最小化延迟:一个消除对未来输入依赖的在线Transformer,以及一个将迭代去噪加速为单步执行的蒸馏流程。我们进一步解决了在直播场景中逐帧处理连续音频信号并保持动画质量一致性的关键设计挑战。我们框架的通用性可扩展至多模态应用,包括情感条件等语义模态,以及集成在VR头显上的头戴式眼动相机等多模态传感器。实验结果表明,与现有离线最先进基线相比,我们的方法在面部动画准确性上取得了显著提升,推理速度提高了100至1000倍。我们通过实时VR演示及多语言演讲等多种场景验证了该方法的有效性。