MM-TTS: Multi-modal Prompt based Style Transfer for Expressive Text-to-Speech Synthesis

The style transfer task in Text-to-Speech refers to the process of transferring style information into text content to generate corresponding speech with a specific style. However, most existing style transfer approaches are either based on fixed emotional labels or reference speech clips, which cannot achieve flexible style transfer. Recently, some methods have adopted text descriptions to guide style transfer. In this paper, we propose a more flexible multi-modal and style controllable TTS framework named MM-TTS. It can utilize any modality as the prompt in unified multi-modal prompt space, including reference speech, emotional facial images, and text descriptions, to control the style of the generated speech in a system. The challenges of modeling such a multi-modal style controllable TTS mainly lie in two aspects:1)aligning the multi-modal information into a unified style space to enable the input of arbitrary modality as the style prompt in a single system, and 2)efficiently transferring the unified style representation into the given text content, thereby empowering the ability to generate prompt style-related voice. To address these problems, we propose an aligned multi-modal prompt encoder that embeds different modalities into a unified style space, supporting style transfer for different modalities. Additionally, we present a new adaptive style transfer method named Style Adaptive Convolutions to achieve a better style representation. Furthermore, we design a Rectified Flow based Refiner to solve the problem of over-smoothing Mel-spectrogram and generate audio of higher fidelity. Since there is no public dataset for multi-modal TTS, we construct a dataset named MEAD-TTS, which is related to the field of expressive talking head. Our experiments on the MEAD-TTS dataset and out-of-domain datasets demonstrate that MM-TTS can achieve satisfactory results based on multi-modal prompts.

翻译：文本到语音合成中的风格迁移任务是指将风格信息融入文本内容，以生成具有特定风格的语音。然而，现有的大多数风格迁移方法要么基于固定的情感标签，要么基于参考语音片段，无法实现灵活的风格迁移。近期，一些方法采用文本描述来引导风格迁移。本文提出了一种更灵活的多模态、可控制风格的TTS框架，命名为MM-TTS。该框架能够利用统一多模态提示空间中的任意模态作为提示（包括参考语音、情感面部图像和文本描述），在单一系统中控制生成语音的风格。建模此类多模态风格可控TTS的挑战主要在于两个方面：1）将多模态信息对齐到统一的风格空间，以便在单个系统中输入任意模态作为风格提示；2）将统一的风格表示高效迁移至给定文本内容，从而赋予模型生成与提示风格相关语音的能力。为解决这些问题，我们提出了一种对齐多模态提示编码器，可将不同模态嵌入到统一的风格空间中，支持不同模态间的风格迁移。此外，我们提出了一种名为“风格自适应卷积”的新的自适应风格迁移方法，以实现更优的风格表示。进一步地，我们设计了一种基于整流流的精炼器，以解决梅尔频谱图过度平滑的问题，并生成更高保真度的音频。由于目前尚无公开的多模态TTS数据集，我们构建了一个名为MEAD-TTS的数据集，该数据集与富有表现力的说话人头像领域相关。在MEAD-TTS数据集和域外数据集上的实验表明，MM-TTS能够基于多模态提示取得令人满意的效果。

相关内容

语音合成

关注 491

语音合成（Speech Synthesis），也称为文语转换（Text-to-Speech, TTS,它是将任意的输入文本转换成自然流畅的语音输出。语音合成涉及到人工智能、心理学、声学、语言学、数字信号处理、计算机科学等多个学科技术，是信息处理领域中的一项前沿技术。随着计算机技术的不断提高，语音合成技术从早期的共振峰合成,逐步发展为波形拼接合成和统计参数语音合成，再发展到混合语音合成；合成语音的质量、自然度已经得到明显提高，基本能满足一些特定场合的应用需求。目前，语音合成技术在银行、医院等的信息播报系统、汽车导航系统、自动应答呼叫中心等都有广泛应用，取得了巨大的经济效益。另外，随着智能手机、MP3、PDA 等与我们生活密切相关的媒介的大量涌现，语音合成的应用也在逐渐向娱乐、语音教学、康复治疗等领域深入。可以说语音合成正在影响着人们生活的方方面面。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日