Visual Text-to-Speech (VTTS) aims to take the spatial environmental image as the prompt to synthesize the reverberation speech for the spoken content. Previous research focused on the RGB modality for global environmental modeling, overlooking the potential of multi-source spatial knowledge like depth, speaker position, and environmental semantics. To address the issues, we propose a novel multi-source spatial knowledge understanding scheme for immersive VTTS, termed MS$^2$KU-VTTS. Specifically, we first prioritize RGB image as the dominant source and consider depth image, speaker position knowledge from object detection, and semantic captions from image understanding LLM as supplementary sources. Afterwards, we propose a serial interaction mechanism to deeply engage with both dominant and supplementary sources. The resulting multi-source knowledge is dynamically integrated based on their contributions.This enriched interaction and integration of multi-source spatial knowledge guides the speech generation model, enhancing the immersive spatial speech experience.Experimental results demonstrate that the MS$^2$KU-VTTS surpasses existing baselines in generating immersive speech. Demos and code are available at: https://github.com/MS2KU-VTTS/MS2KU-VTTS.
翻译:视觉文本到语音(VTTS)旨在以空间环境图像作为提示,为口语内容合成具有混响效果的语音。先前的研究侧重于利用RGB模态进行全局环境建模,忽略了深度、说话者位置和环境语义等多源空间知识的潜力。为解决这些问题,我们提出了一种面向沉浸式VTTS的新型多源空间知识理解方案,称为MS$^2$KU-VTTS。具体而言,我们首先将RGB图像作为主导源,并将深度图像、来自目标检测的说话者位置知识以及来自图像理解大语言模型的语义描述作为补充源。随后,我们提出了一种串行交互机制,以深入利用主导源和补充源。最终的多源知识将根据其贡献进行动态整合。这种多源空间知识的丰富交互与整合指导了语音生成模型,从而提升了沉浸式空间语音体验。实验结果表明,MS$^2$KU-VTTS在生成沉浸式语音方面超越了现有基线方法。演示和代码可在以下网址获取:https://github.com/MS2KU-VTTS/MS2KU-VTTS。