We present VQTalker, a Vector Quantization-based framework for multilingual talking head generation that addresses the challenges of lip synchronization and natural motion across diverse languages. Our approach is grounded in the phonetic principle that human speech comprises a finite set of distinct sound units (phonemes) and corresponding visual articulations (visemes), which often share commonalities across languages. We introduce a facial motion tokenizer based on Group Residual Finite Scalar Quantization (GRFSQ), which creates a discretized representation of facial features. This method enables comprehensive capture of facial movements while improving generalization to multiple languages, even with limited training data. Building on this quantized representation, we implement a coarse-to-fine motion generation process that progressively refines facial animations. Extensive experiments demonstrate that VQTalker achieves state-of-the-art performance in both video-driven and speech-driven scenarios, particularly in multilingual settings. Notably, our method achieves high-quality results at a resolution of 512*512 pixels while maintaining a lower bitrate of approximately 11 kbps. Our work opens new possibilities for cross-lingual talking face generation. Synthetic results can be viewed at https://x-lance.github.io/VQTalker.
翻译:本文提出VQTalker,一种基于向量量化的多语言说话头部生成框架,旨在解决跨语言场景下的唇形同步与自然运动挑战。该方法基于语音学原理:人类语音由有限数量的独立声音单元(音素)及其对应的视觉发音单元(视位)构成,这些单元在不同语言间常存在共性。我们引入了一种基于分组残差有限标量量化(GRFSQ)的面部运动标记化器,可创建面部特征的离散化表示。该方法能够全面捕捉面部运动,并在有限训练数据下提升对多语言的泛化能力。基于此量化表示,我们实现了从粗到细的运动生成流程,逐步优化面部动画效果。大量实验表明,VQTalker在视频驱动与语音驱动场景中均达到最先进性能,尤其在多语言环境下表现突出。值得注意的是,本方法在512*512像素分辨率下仍能保持约11 kbps的低比特率,同时获得高质量生成结果。本研究为跨语言说话人脸生成开辟了新的可能性。合成结果可通过 https://x-lance.github.io/VQTalker 查看。