Movie dubbing is the task of synthesizing speech from scripts conditioned on video scenes, requiring accurate lip sync, faithful timbre transfer, and proper modeling of character identity and emotion. However, existing methods face two major limitations: (1) high-quality multimodal dubbing datasets are limited in scale, suffer from high word error rates, contain sparse annotations, rely on costly manual labeling, and are restricted to monologue scenes, all of which hinder effective model training; (2) existing dubbing models rely solely on the lip region to learn audio-visual alignment, which limits their applicability to complex live-action cinematic scenes, and exhibit suboptimal performance in lip sync, speech quality, and emotional expressiveness. To address these issues, we propose FunCineForge, which comprises an end-to-end production pipeline for large-scale dubbing datasets and an MLLM-based dubbing model designed for diverse cinematic scenes. Using the pipeline, we construct the first Chinese television dubbing dataset with rich annotations, and demonstrate the high quality of these data. Experiments across monologue, narration, dialogue, and multi-speaker scenes show that our dubbing model consistently outperforms SOTA methods in audio quality, lip sync, timbre transfer, and instruction following. Code and demos are available at https://anonymous.4open.science/w/FunCineForge.
翻译:电影配音任务旨在根据视频场景条件从剧本合成语音,要求实现准确的唇形同步、忠实的音色转换以及对角色身份与情感的正确建模。然而,现有方法面临两大局限:(1)高质量多模态配音数据集规模有限,存在高词错误率、标注稀疏、依赖昂贵人工标注且仅限于独白场景等问题,均阻碍了模型的有效训练;(2)现有配音模型仅依赖唇部区域学习视听对齐,这限制了其在复杂实拍电影场景中的适用性,且在唇形同步、语音质量和情感表现力方面表现欠佳。为解决这些问题,我们提出了FunCineForge,它包含一个用于大规模配音数据集的端到端生产流程,以及一个基于MLLM、为多样化电影场景设计的配音模型。利用该流程,我们构建了首个具有丰富标注的中文电视剧配音数据集,并证明了这些数据的高质量。在独白、旁白、对话及多说话者场景上的实验表明,我们的配音模型在音频质量、唇形同步、音色转换和指令跟随方面均持续优于现有最优方法。代码与演示可在 https://anonymous.4open.science/w/FunCineForge 获取。