Audio-visual alignment after dubbing is a challenging research problem. To this end, we propose a novel method, DubWise Multi-modal Large Language Model (LLM)-based Text-to-Speech (TTS), which can control the speech duration of synthesized speech in such a way that it aligns well with the speakers lip movements given in the reference video even when the spoken text is different or in a different language. To accomplish this, we propose to utilize cross-modal attention techniques in a pre-trained GPT-based TTS. We combine linguistic tokens from text, speaker identity tokens via a voice cloning network, and video tokens via a proposed duration controller network. We demonstrate the effectiveness of our system on the Lip2Wav-Chemistry and LRS2 datasets. Also, the proposed method achieves improved lip sync and naturalness compared to the SOTAs for the same language but different text (i.e., non-parallel) and the different language, different text (i.e., cross-lingual) scenarios.
翻译:配音后的音视频对齐是一个具有挑战性的研究问题。为此,我们提出了一种新颖的方法——DubWise,一种基于多模态大语言模型的文本转语音系统。该方法能够控制合成语音的时长,使其与参考视频中说话者的唇部运动良好对齐,即使所说的文本不同或语言不同。为实现这一目标,我们提出在预训练的基于GPT的TTS模型中利用跨模态注意力技术。我们结合了来自文本的语言学标记、通过语音克隆网络生成的说话人身份标记,以及通过我们提出的时长控制器网络生成的视频标记。我们在Lip2Wav-Chemistry和LRS2数据集上验证了系统的有效性。此外,与现有最优方法相比,所提方法在相同语言但不同文本(即非平行)以及不同语言、不同文本(即跨语言)场景下,均实现了更好的唇部同步效果和自然度。