Video dubbing requires content accuracy, expressive prosody, high-quality acoustics, and precise lip synchronization, yet existing approaches struggle on all four fronts. To address these issues, we propose DiFlowDubber, the first video dubbing framework built upon a discrete flow matching backbone with a novel two-stage training strategy. In the first stage, a zero-shot text-to-speech (TTS) system is pre-trained on large-scale corpora, where a deterministic architecture captures linguistic structures, and the Discrete Flow-based Prosody-Acoustic (DFPA) module models expressive prosody and realistic acoustic characteristics. In the second stage, we propose the Content-Consistent Temporal Adaptation (CCTA) to transfer TTS knowledge to the dubbing domain: its Synchronizer enforces cross-modal alignment for lip-synchronized speech. Complementarily, the Face-to-Prosody Mapper (FaPro) conditions prosody on facial expressions, whose outputs are then fused with those of the Synchronizer to construct rich, fine-grained multimodal embeddings that capture prosody-content correlations, guiding the DFPA to generate expressive prosody and acoustic tokens for content-consistent speech. Experiments on two benchmark datasets demonstrate that DiFlowDubber outperforms prior methods across multiple evaluation metrics.
翻译:视频配音要求内容准确、韵律富有表现力、声学质量高且唇形同步精确,然而现有方法在这四个方面均存在不足。为解决这些问题,我们提出DiFlowDubber——首个基于离散流匹配骨干网络并采用新颖两阶段训练策略的视频配音框架。第一阶段,在大规模语料库上预训练零样本文本转语音(TTS)系统,其中确定性架构捕获语言结构,而基于离散流的韵律-声学(DFPA)模块建模富有表现力的韵律和逼真的声学特征。第二阶段,我们提出内容一致的时间自适应(CCTA)方法将TTS知识迁移至配音领域:其同步器通过跨模态对齐强制实现唇形同步语音。作为补充,面部到韵律映射器(FaPro)以面部表情为条件生成韵律,其输出随后与同步器输出融合,构建捕捉韵律-内容相关性的丰富细粒度多模态嵌入,指导DFPA生成富有表现力的韵律和声学令牌,从而产生内容一致的语音。在两个基准数据集上的实验表明,DiFlowDubber在多项评估指标上均优于现有方法。