We introduce and define a novel task-Scene-Aware Visually-Driven Speech Synthesis, aimed at addressing the limitations of existing speech generation models in creating immersive auditory experiences that align with the real physical world. To tackle the two core challenges of data scarcity and modality decoupling, we propose VividVoice, a unified generative framework. First, we constructed a large-scale, high-quality hybrid multimodal dataset, Vivid-210K, which, through an innovative programmatic pipeline, establishes a strong correlation between visual scenes, speaker identity, and audio for the first time. Second, we designed a core alignment module, D-MSVA, which leverages a decoupled memory bank architecture and a cross-modal hybrid supervision strategy to achieve fine-grained alignment from visual scenes to timbre and environmental acoustic features. Both subjective and objective experimental results provide strong evidence that VividVoice significantly outperforms existing baseline models in terms of audio fidelity, content clarity, and multimodal consistency. Our demo is available at https://chengyuann.github.io/VividVoice/.
翻译:本文提出并定义了一项新颖的任务——场景感知的视觉驱动语音合成,旨在解决现有语音生成模型在创造与现实物理世界相契合的沉浸式听觉体验方面的局限性。为应对数据稀缺和模态解耦这两大核心挑战,我们提出了VividVoice,一个统一的生成框架。首先,我们构建了一个大规模、高质量的混合多模态数据集Vivid-210K,该数据集通过创新的程序化流程,首次在视觉场景、说话人身份与音频之间建立了强关联。其次,我们设计了一个核心对齐模块D-MSVA,它利用解耦的记忆库架构和跨模态混合监督策略,实现了从视觉场景到音色及环境声学特征的细粒度对齐。主客观实验结果均有力证明,VividVoice在音频保真度、内容清晰度和多模态一致性方面显著优于现有基线模型。我们的演示可在 https://chengyuann.github.io/VividVoice/ 获取。