Movie dubbing aims to synthesize speech that preserves the vocal identity of a reference audio while synchronizing with the lip movements in a target video. Existing methods fail to achieve precise lip-sync and lack naturalness due to explicit alignment at the duration level. While implicit alignment solutions have emerged, they remain susceptible to interference from the reference audio, triggering timbre and pronunciation degradation in in-the-wild scenarios. In this paper, we propose a novel flow matching-based movie dubbing framework driven by the Cognitive Synchronous Diffusion Transformer (CoSync-DiT), inspired by the cognitive process of professional actors. This architecture progressively guides the noise-to-speech generative trajectory by executing acoustic style adapting, fine-grained visual calibrating, and time-aware context aligning. Furthermore, we design the Joint Semantic and Alignment Regularization (JSAR) mechanism to simultaneously constrain frame-level temporal consistency on the contextual outputs and semantic consistency on the flow hidden states, ensuring robust alignment. Extensive experiments on both standard benchmarks and challenging in-the-wild dubbing benchmarks demonstrate that our method achieves the state-of-the-art performance across multiple metrics.
翻译:电影配音旨在合成保留参考音频声纹特征、同时与目标视频唇部运动同步的语音。现有方法因在时长层级进行显式对齐,难以实现精确的唇形同步且缺乏自然度。尽管已出现隐式对齐方案,但在真实场景中仍易受参考音频干扰,导致音色与发音质量下降。受专业演员认知过程的启发,本文提出一种基于流匹配的新型电影配音框架——认知同步扩散Transformer(CoSync-DiT)。该架构通过执行声学风格适配、细粒度视觉校准与时态感知上下文对齐,逐步引导噪声到语音的生成轨迹。此外,我们设计了联合语义与对齐正则化(JSAR)机制,在约束上下文输出的帧级时间一致性与流隐状态语义一致性的同时,确保稳健对齐。在标准基准和具有挑战性的真实场景配音基准上的大量实验表明,我们的方法在多项指标上均达到了最先进水平。