Recent studies in speech-driven 3D talking head generation have achieved convincing results in verbal articulations. However, generating accurate lip-syncs degrades when applied to input speech in other languages, possibly due to the lack of datasets covering a broad spectrum of facial movements across languages. In this work, we introduce a novel task to generate 3D talking heads from speeches of diverse languages. We collect a new multilingual 2D video dataset comprising over 420 hours of talking videos in 20 languages. With our proposed dataset, we present a multilingually enhanced model that incorporates language-specific style embeddings, enabling it to capture the unique mouth movements associated with each language. Additionally, we present a metric for assessing lip-sync accuracy in multilingual settings. We demonstrate that training a 3D talking head model with our proposed dataset significantly enhances its multilingual performance. Codes and datasets are available at https://multi-talk.github.io/.
翻译:近年来,语音驱动的3D说话头生成研究在言语发音方面已取得令人信服的成果。然而,当应用于其他语言的输入语音时,生成准确的唇形同步效果会下降,这可能是由于缺乏覆盖多语言广泛面部动作的数据集。在本工作中,我们提出了一项从多种语言语音生成3D说话头的新任务。我们收集了一个全新的多语言2D视频数据集,包含超过420小时、涵盖20种语言的说话视频。基于我们提出的数据集,我们提出了一种多语言增强模型,该模型融合了语言特定的风格嵌入,使其能够捕捉每种语言特有的嘴部运动。此外,我们还提出了一种用于评估多语言环境下唇形同步准确性的度量标准。我们证明,使用我们提出的数据集训练3D说话头模型能显著提升其多语言性能。代码和数据集可在 https://multi-talk.github.io/ 获取。