EmoDubber: Towards High Quality and Emotion Controllable Movie Dubbing

Given a piece of text, a video clip, and a reference audio, the movie dubbing task aims to generate speech that aligns with the video while cloning the desired voice. The existing methods have two primary deficiencies: (1) They struggle to simultaneously hold audio-visual sync and achieve clear pronunciation; (2) They lack the capacity to express user-defined emotions. To address these problems, we propose EmoDubber, an emotion-controllable dubbing architecture that allows users to specify emotion type and emotional intensity while satisfying high-quality lip sync and pronunciation. Specifically, we first design Lip-related Prosody Aligning (LPA), which focuses on learning the inherent consistency between lip motion and prosody variation by duration level contrastive learning to incorporate reasonable alignment. Then, we design Pronunciation Enhancing (PE) strategy to fuse the video-level phoneme sequences by efficient conformer to improve speech intelligibility. Next, the speaker identity adapting module aims to decode acoustics prior and inject the speaker style embedding. After that, the proposed Flow-based User Emotion Controlling (FUEC) is used to synthesize waveform by flow matching prediction network conditioned on acoustics prior. In this process, the FUEC determines the gradient direction and guidance scale based on the user's emotion instructions by the positive and negative guidance mechanism, which focuses on amplifying the desired emotion while suppressing others. Extensive experimental results on three benchmark datasets demonstrate favorable performance compared to several state-of-the-art methods.

翻译：给定一段文本、一个视频片段和一段参考音频，电影配音任务旨在生成与视频同步且能克隆目标语音的语音。现有方法存在两个主要缺陷：（1）难以同时保持音视频同步与实现清晰发音；（2）缺乏表达用户定义情感的能力。为解决这些问题，我们提出EmoDubber，一种情感可控的配音架构，允许用户在满足高质量口型同步与发音的同时，指定情感类型与情感强度。具体而言，我们首先设计唇部相关韵律对齐模块，该模块通过时长级对比学习，专注于学习唇部运动与韵律变化之间的内在一致性，以融入合理的对齐关系。其次，我们设计发音增强策略，通过高效的conformer融合视频级音素序列，以提升语音可懂度。随后，说话人身份适配模块旨在解码声学先验并注入说话人风格嵌入。此后，所提出的基于流的用户情感控制模块用于通过以声学先验为条件的流匹配预测网络合成波形。在此过程中，该模块通过正负引导机制，根据用户的情感指令确定梯度方向与引导强度，其核心在于放大目标情感同时抑制其他情感。在三个基准数据集上的大量实验结果表明，与多种先进方法相比，本方法取得了优越的性能。