Current audio-driven 3D head generation methods mainly focus on single-speaker scenarios, lacking natural, bidirectional listen-and-speak interaction. Achieving seamless conversational behavior, where speaking and listening states transition fluidly remains a key challenge. Existing 3D conversational avatar approaches rely on error-prone pseudo-3D labels that fail to capture fine-grained facial dynamics. To address these limitations, we introduce a novel two-stage framework MANGO, which leveraging pure image-level supervision by alternately training to mitigate the noise introduced by pseudo-3D labels, thereby achieving better alignment with real-world conversational behaviors. Specifically, in the first stage, a diffusion-based transformer with a dual-audio interaction module models natural 3D motion from multi-speaker audio. In the second stage, we use a fast 3D Gaussian Renderer to generate high-fidelity images and provide 2D-level photometric supervision for the 3D motions through alternate training. Additionally, we introduce MANGO-Dialog, a high-quality dataset with over 50 hours of aligned 2D-3D conversational data across 500+ identities. Extensive experiments demonstrate that our method achieves exceptional accuracy and realism in modeling two-person 3D dialogue motion, significantly advancing the fidelity and controllability of audio-driven talking heads.
翻译:当前音频驱动的三维头部生成方法主要集中于单说话人场景,缺乏自然、双向的听说交互。实现流畅的对话行为——即说话与倾听状态间的无缝转换——仍是一个关键挑战。现有的三维对话化身方法依赖于易出错的伪三维标签,这些标签无法捕捉细粒度的面部动态。为应对这些局限,我们提出了一种新颖的两阶段框架MANGO,该框架通过交替训练利用纯图像级监督来减轻伪三维标签引入的噪声,从而实现与现实世界对话行为更好的对齐。具体而言,在第一阶段,一个配备双音频交互模块的基于扩散的Transformer从多说话人音频中建模自然的三维运动。在第二阶段,我们使用一个快速的三维高斯渲染器来生成高保真图像,并通过交替训练为三维运动提供二维层面的光度学监督。此外,我们引入了MANGO-Dialog,这是一个高质量数据集,包含超过500个身份、总计50小时以上的对齐二维-三维对话数据。大量实验表明,我们的方法在建模双人三维对话运动方面取得了卓越的准确性和真实感,显著提升了音频驱动说话头的保真度和可控性。