The style transfer task in Text-to-Speech refers to the process of transferring style information into text content to generate corresponding speech with a specific style. However, most existing style transfer approaches are either based on fixed emotional labels or reference speech clips, which cannot achieve flexible style transfer. Recently, some methods have adopted text descriptions to guide style transfer. In this paper, we propose a more flexible multi-modal and style controllable TTS framework named MM-TTS. It can utilize any modality as the prompt in unified multi-modal prompt space, including reference speech, emotional facial images, and text descriptions, to control the style of the generated speech in a system. The challenges of modeling such a multi-modal style controllable TTS mainly lie in two aspects:1)aligning the multi-modal information into a unified style space to enable the input of arbitrary modality as the style prompt in a single system, and 2)efficiently transferring the unified style representation into the given text content, thereby empowering the ability to generate prompt style-related voice. To address these problems, we propose an aligned multi-modal prompt encoder that embeds different modalities into a unified style space, supporting style transfer for different modalities. Additionally, we present a new adaptive style transfer method named Style Adaptive Convolutions to achieve a better style representation. Furthermore, we design a Rectified Flow based Refiner to solve the problem of over-smoothing Mel-spectrogram and generate audio of higher fidelity. Since there is no public dataset for multi-modal TTS, we construct a dataset named MEAD-TTS, which is related to the field of expressive talking head. Our experiments on the MEAD-TTS dataset and out-of-domain datasets demonstrate that MM-TTS can achieve satisfactory results based on multi-modal prompts.
翻译:文本到语音合成中的风格迁移任务是指将风格信息融入文本内容,以生成具有特定风格的语音。然而,现有的大多数风格迁移方法要么基于固定的情感标签,要么基于参考语音片段,无法实现灵活的风格迁移。近期,一些方法采用文本描述来引导风格迁移。本文提出了一种更灵活的多模态、可控制风格的TTS框架,命名为MM-TTS。该框架能够利用统一多模态提示空间中的任意模态作为提示(包括参考语音、情感面部图像和文本描述),在单一系统中控制生成语音的风格。建模此类多模态风格可控TTS的挑战主要在于两个方面:1)将多模态信息对齐到统一的风格空间,以便在单个系统中输入任意模态作为风格提示;2)将统一的风格表示高效迁移至给定文本内容,从而赋予模型生成与提示风格相关语音的能力。为解决这些问题,我们提出了一种对齐多模态提示编码器,可将不同模态嵌入到统一的风格空间中,支持不同模态间的风格迁移。此外,我们提出了一种名为“风格自适应卷积”的新的自适应风格迁移方法,以实现更优的风格表示。进一步地,我们设计了一种基于整流流的精炼器,以解决梅尔频谱图过度平滑的问题,并生成更高保真度的音频。由于目前尚无公开的多模态TTS数据集,我们构建了一个名为MEAD-TTS的数据集,该数据集与富有表现力的说话人头像领域相关。在MEAD-TTS数据集和域外数据集上的实验表明,MM-TTS能够基于多模态提示取得令人满意的效果。