MCDubber: Multimodal Context-Aware Expressive Video Dubbing

Automatic Video Dubbing (AVD) aims to take the given script and generate speech that aligns with lip motion and prosody expressiveness. Current AVD models mainly utilize visual information of the current sentence to enhance the prosody of synthesized speech. However, it is crucial to consider whether the prosody of the generated dubbing aligns with the multimodal context, as the dubbing will be combined with the original context in the final video. This aspect has been overlooked in previous studies. To address this issue, we propose a Multimodal Context-aware video Dubbing model, termed \textbf{MCDubber}, to convert the modeling object from a single sentence to a longer sequence with context information to ensure the consistency of the global context prosody. MCDubber comprises three main components: (1) A context duration aligner aims to learn the context-aware alignment between the text and lip frames; (2) A context prosody predictor seeks to read the global context visual sequence and predict the context-aware global energy and pitch; (3) A context acoustic decoder ultimately predicts the global context mel-spectrogram with the assistance of adjacent ground-truth mel-spectrograms of the target sentence. Through this process, MCDubber fully considers the influence of multimodal context on the prosody expressiveness of the current sentence when dubbing. The extracted mel-spectrogram belonging to the target sentence from the output context mel-spectrograms is the final required dubbing audio. Extensive experiments on the Chem benchmark dataset demonstrate that our MCDubber significantly improves dubbing expressiveness compared to all advanced baselines. The code and demos are available at https://github.com/XiaoYuanJun-zy/MCDubber.

翻译：自动视频配音（AVD）旨在根据给定的脚本生成与唇部动作和韵律表现力相匹配的语音。当前的AVD模型主要利用当前句子的视觉信息来增强合成语音的韵律。然而，考虑生成的配音韵律是否与多模态上下文保持一致至关重要，因为配音将在最终视频中与原始上下文结合。这一方面在先前的研究中被忽视了。为了解决这个问题，我们提出了一种多模态上下文感知视频配音模型，称为**MCDubber**，将建模对象从单个句子转换为包含上下文信息的更长序列，以确保全局上下文韵律的一致性。MCDubber包含三个主要组件：（1）上下文时长对齐器，旨在学习文本与唇部帧之间的上下文感知对齐；（2）上下文韵律预测器，旨在读取全局上下文视觉序列并预测上下文感知的全局能量和音高；（3）上下文声学解码器，最终在目标句子相邻真实梅尔频谱图的辅助下预测全局上下文梅尔频谱图。通过这一过程，MCDubber在配音时充分考虑了多模态上下文对当前句子韵律表现力的影响。从输出的上下文梅尔频谱图中提取属于目标句子的梅尔频谱图，即为最终所需的配音音频。在Chem基准数据集上进行的大量实验表明，与所有先进基线相比，我们的MCDubber显著提高了配音表现力。代码和演示可在https://github.com/XiaoYuanJun-zy/MCDubber获取。