The generation of audio-driven talking head videos is a key challenge in computer vision and graphics, with applications in virtual avatars and digital media. Traditional approaches often struggle with capturing the complex interaction between audio and facial dynamics, leading to lip synchronization and visual quality issues. In this paper, we propose a novel NeRF-based framework, Dual Audio-Centric Modality Coupling (DAMC), which effectively integrates content and dynamic features from audio inputs. By leveraging a dual encoder structure, DAMC captures semantic content through the Content-Aware Encoder and ensures precise visual synchronization through the Dynamic-Sync Encoder. These features are fused using a Cross-Synchronized Fusion Module (CSFM), enhancing content representation and lip synchronization. Extensive experiments show that our method outperforms existing state-of-the-art approaches in key metrics such as lip synchronization accuracy and image quality, demonstrating robust generalization across various audio inputs, including synthetic speech from text-to-speech (TTS) systems. Our results provide a promising solution for high-quality, audio-driven talking head generation and present a scalable approach for creating realistic talking heads.
翻译:音频驱动说话头视频的生成是计算机视觉与图形学领域的关键挑战,在虚拟化身和数字媒体中具有重要应用。传统方法往往难以捕捉音频与面部动态间的复杂交互,导致唇部同步与视觉质量问题。本文提出一种新颖的基于神经辐射场(NeRF)的框架——双中心模态耦合方法(DAMC),该框架能有效整合音频输入中的内容特征与动态特征。通过采用双编码器结构,DAMC利用内容感知编码器捕捉语义内容,并通过动态同步编码器确保精确的视觉同步。这些特征通过跨模态同步融合模块(CSFM)进行融合,从而增强内容表征与唇部同步效果。大量实验表明,本方法在唇部同步精度与图像质量等关键指标上均优于现有先进方法,并在包括文本转语音(TTS)系统生成的合成语音在内的多种音频输入上展现出强大的泛化能力。我们的研究成果为高质量音频驱动说话头生成提供了可行方案,并为创建逼真说话头提出了一种可扩展的途径。