Audio-Visual Foundation Models, which are pretrained to jointly generate sound and visual content, have recently shown an unprecedented ability to model multi-modal generation and editing, opening new opportunities for downstream tasks. Among these tasks, video dubbing could greatly benefit from such priors, yet most existing solutions still rely on complex, task-specific pipelines that struggle in real-world settings. In this work, we introduce a single-model approach that adapts a foundational audio-video diffusion model for video-to-video dubbing via a lightweight LoRA. The LoRA enables the model to condition on an input audio-video while jointly generating translated audio and synchronized facial motion. To train this LoRA, we leverage the generative model itself to synthesize paired multilingual videos of the same speaker. Specifically, we generate multilingual videos with language switches within a single clip, and then inpaint the face and audio in each half to match the language of the other half. By leveraging the rich generative prior of the audio-visual model, our approach preserves speaker identity and lip synchronization while remaining robust to complex motion and real-world dynamics. We demonstrate that our approach produces high-quality dubbed videos with improved visual fidelity, lip synchronization, and robustness compared to existing dubbing pipelines.
翻译:音频-视觉基础模型经过预训练以联合生成声音与视觉内容,近期在建模多模态生成与编辑方面展现出前所未有的能力,为下游任务开辟了新的机遇。在这些任务中,视频配音可极大地受益于此先验知识,然而现有解决方案大多仍依赖于复杂且任务特定的流程,在真实场景中表现不佳。本研究提出一种单模型方法,通过轻量级LoRA适配基础音频-视频扩散模型,实现视频到视频的配音。该LoRA使模型能够以输入音频-视频为条件,同时生成翻译后的音频与同步的面部动作。为训练此LoRA,我们利用生成模型自身合成同一说话者的多语言配对视频。具体而言,我们生成包含单片段内语言切换的多语言视频,随后对每一半的面部与音频进行修复,以匹配另一半的语言。通过利用音频-视觉模型丰富的生成先验,我们的方法在保持说话者身份与唇部同步的同时,对复杂运动与真实世界动态保持鲁棒性。实验表明,相较于现有配音流程,我们的方法能生成具有更高视觉保真度、更优唇部同步及更强鲁棒性的高质量配音视频。