DualTalker: A Cross-Modal Dual Learning Approach for Speech-Driven 3D Facial Animation

In recent years, audio-driven 3D facial animation has gained significant attention, particularly in applications such as virtual reality, gaming, and video conferencing. However, accurately modeling the intricate and subtle dynamics of facial expressions remains a challenge. Most existing studies approach the facial animation task as a single regression problem, which often fail to capture the intrinsic inter-modal relationship between speech signals and 3D facial animation and overlook their inherent consistency. Moreover, due to the limited availability of 3D-audio-visual datasets, approaches learning with small-size samples have poor generalizability that decreases the performance. To address these issues, in this study, we propose a cross-modal dual-learning framework, termed DualTalker, aiming at improving data usage efficiency as well as relating cross-modal dependencies. The framework is trained jointly with the primary task (audio-driven facial animation) and its dual task (lip reading) and shares common audio/motion encoder components. Our joint training framework facilitates more efficient data usage by leveraging information from both tasks and explicitly capitalizing on the complementary relationship between facial motion and audio to improve performance. Furthermore, we introduce an auxiliary cross-modal consistency loss to mitigate the potential over-smoothing underlying the cross-modal complementary representations, enhancing the mapping of subtle facial expression dynamics. Through extensive experiments and a perceptual user study conducted on the VOCA and BIWI datasets, we demonstrate that our approach outperforms current state-of-the-art methods both qualitatively and quantitatively. We have made our code and video demonstrations available at https://github.com/sabrina-su/iadf.git.

翻译：近年来，音频驱动三维面部动画在虚拟现实、游戏及视频会议等应用中备受关注。然而，精确建模面部表情复杂而微妙的动态特性仍是一项挑战。现有研究大多将面部动画任务视为单一回归问题，此类方法往往难以捕捉语音信号与三维面部动画间的内在跨模态关联，并忽略了其固有的一致性。此外，由于三维视听数据集的有限性，基于小样本学习的方法泛化能力不足，导致性能下降。为解决这些问题，本研究提出一种名为DualTalker的跨模态对偶学习框架，旨在提升数据利用效率并关联跨模态依赖关系。该框架联合训练主要任务（音频驱动面部动画）与其对偶任务（唇语识别），并共享音频/运动编码器组件。我们的联合训练框架通过利用双任务信息，并明确利用面部运动与音频之间的互补关系，促进更高效的数据利用以提升性能。此外，我们引入辅助跨模态一致性损失函数以缓解跨模态互补表示中潜在的过度平滑问题，从而增强细微面部表情动态的映射能力。通过在VOCA和BIWI数据集上的大量实验及感知用户研究，我们证明了所提方法在定性与定量指标上均优于当前最先进方法。相关代码与视频演示已发布于https://github.com/sabrina-su/iadf.git。