Visual Text-to-Speech (VTTS) aims to take the environmental image as the prompt to synthesize the reverberant speech for the spoken content. The challenge of this task lies in understanding the spatial environment from the image. Many attempts have been made to extract global spatial visual information from the RGB space of an spatial image. However, local and depth image information are crucial for understanding the spatial environment, which previous works have ignored. To address the issues, we propose a novel multi-modal and multi-scale spatial environment understanding scheme to achieve immersive VTTS, termed M2SE-VTTS. The multi-modal aims to take both the RGB and Depth spaces of the spatial image to learn more comprehensive spatial information, and the multi-scale seeks to model the local and global spatial knowledge simultaneously. Specifically, we first split the RGB and Depth images into patches and adopt the Gemini-generated environment captions to guide the local spatial understanding. After that, the multi-modal and multi-scale features are integrated by the local-aware global spatial understanding. In this way, M2SE-VTTS effectively models the interactions between local and global spatial contexts in the multi-modal spatial environment. Objective and subjective evaluations suggest that our model outperforms the advanced baselines in environmental speech generation. The code and audio samples are available at: https://github.com/AI-S2-Lab/M2SE-VTTS.
翻译:视觉文本转语音(VTTS)旨在以环境图像作为提示,为口语内容合成具有混响效果的语音。该任务的挑战在于从图像中理解空间环境。已有许多尝试从空间图像的RGB空间中提取全局空间视觉信息。然而,局部与深度图像信息对于理解空间环境至关重要,而先前的研究忽略了这一点。为解决这些问题,我们提出了一种新颖的多模态与多尺度空间环境理解方案,以实现沉浸式VTTS,称为M2SE-VTTS。多模态旨在同时利用空间图像的RGB与深度空间以学习更全面的空间信息,而多尺度则寻求同时建模局部与全局空间知识。具体而言,我们首先将RGB与深度图像分割为图像块,并采用Gemini生成的环境描述来指导局部空间理解。随后,通过局部感知的全局空间理解模块整合多模态与多尺度特征。通过这种方式,M2SE-VTTS有效地建模了多模态空间环境中局部与全局空间上下文之间的交互。客观与主观评估表明,我们的模型在环境语音生成方面优于先进基线。代码与音频样本发布于:https://github.com/AI-S2-Lab/M2SE-VTTS。