The goal of this work is to simultaneously generate natural talking faces and speech outputs from text. We achieve this by integrating Talking Face Generation (TFG) and Text-to-Speech (TTS) systems into a unified framework. We address the main challenges of each task: (1) generating a range of head poses representative of real-world scenarios, and (2) ensuring voice consistency despite variations in facial motion for the same identity. To tackle these issues, we introduce a motion sampler based on conditional flow matching, which is capable of high-quality motion code generation in an efficient way. Moreover, we introduce a novel conditioning method for the TTS system, which utilises motion-removed features from the TFG model to yield uniform speech outputs. Our extensive experiments demonstrate that our method effectively creates natural-looking talking faces and speech that accurately match the input text. To our knowledge, this is the first effort to build a multimodal synthesis system that can generalise to unseen identities.
翻译:本工作的目标是同时从文本生成自然的说话人脸与语音输出。我们通过将说话人脸生成(TFG)和文本转语音(TTS)系统整合到一个统一框架中来实现这一目标。我们解决了每项任务的主要挑战:(1)生成代表真实场景中各种头部姿态的序列,以及(2)确保同一身份在面部运动变化时语音的一致性。为解决这些问题,我们引入了一种基于条件流匹配的运动采样器,能够以高效的方式生成高质量的运动编码。此外,我们提出了一种新颖的TTS系统条件化方法,该方法利用来自TFG模型的运动移除特征来产生一致的语音输出。我们的广泛实验表明,该方法能有效创建与输入文本精确匹配的自然说话人脸与语音。据我们所知,这是首次构建能够泛化至未见身份的多模态合成系统的尝试。