Expressive speech synthesis models are trained by adding corpora with diverse speakers, various emotions, and different speaking styles to the dataset, in order to control various characteristics of speech and generate the desired voice. In this paper, we propose a style control (SC) VALL-E model based on the neural codec language model (called VALL-E), which follows the structure of the generative pretrained transformer 3 (GPT-3). The proposed SC VALL-E takes input from text sentences and prompt audio and is designed to generate controllable speech by not simply mimicking the characteristics of the prompt audio but by controlling the attributes to produce diverse voices. We identify tokens in the style embedding matrix of the newly designed style network that represent attributes such as emotion, speaking rate, pitch, and voice intensity, and design a model that can control these attributes. To evaluate the performance of SC VALL-E, we conduct comparative experiments with three representative expressive speech synthesis models: global style token (GST) Tacotron2, variational autoencoder (VAE) Tacotron2, and original VALL-E. We measure word error rate (WER), F0 voiced error (FVE), and F0 gross pitch error (F0GPE) as evaluation metrics to assess the accuracy of generated sentences. For comparing the quality of synthesized speech, we measure comparative mean option score (CMOS) and similarity mean option score (SMOS). To evaluate the style control ability of the generated speech, we observe the changes in F0 and mel-spectrogram by modifying the trained tokens. When using prompt audio that is not present in the training data, SC VALL-E generates a variety of expressive sounds and demonstrates competitive performance compared to the existing models. Our implementation, pretrained models, and audio samples are located on GitHub.
翻译:为控制语音的各种特征并生成期望的声音,表现性语音合成模型通过在数据集中添加包含不同说话人、多种情感及不同说话风格的语料库进行训练。本文提出一种基于神经编解码语言模型(VALL-E)的风格控制(SC)VALL-E模型,该模型遵循生成式预训练Transformer 3(GPT-3)的结构。所提出的SC VALL-E以文本句子和提示音频为输入,旨在通过不简单模仿提示音频特征而是控制属性以生成多样化语音,进而实现可控语音生成。我们识别出新设计的风格网络风格嵌入矩阵中代表情感、语速、音高和语音强度等属性的令牌,并设计了一个可控制这些属性的模型。为评估SC VALL-E的性能,我们与三种代表性表现性语音合成模型(全局风格令牌(GST)Tacotron2、变分自编码器(VAE)Tacotron2以及原始VALL-E)进行了对比实验。采用词错误率(WER)、基频浊音错误(FVE)和基频粗大音高误差(F0GPE)作为评估指标衡量生成句子的准确性;通过比较平均意见得分(CMOS)和相似度平均意见得分(SMOS)对比合成语音质量。为评估生成语音的风格控制能力,我们通过修改已训练令牌观察基频和梅尔频谱图的变化。当使用训练数据中未出现的提示音频时,SC VALL-E能生成多种富有表现力的声音,并展现出与现有模型相比具有竞争力的性能。我们的实现代码、预训练模型及音频样本已发布于GitHub。