PromptTTS 2: Describing and Generating Voices with Text Prompt

Yichong Leng,Zhifang Guo,Kai Shen,Xu Tan,Zeqian Ju,Yanqing Liu,Yufei Liu,Dongchao Yang,Leying Zhang,Kaitao Song,Lei He,Xiang-Yang Li,Sheng Zhao,Tao Qin,Jiang Bian

from arxiv, Demo page: https://speechresearch.github.io/prompttts2

Speech conveys more information than just text, as the same word can be uttered in various voices to convey diverse information. Compared to traditional text-to-speech (TTS) methods relying on speech prompts (reference speech) for voice variability, using text prompts (descriptions) is more user-friendly since speech prompts can be hard to find or may not exist at all. TTS approaches based on the text prompt face two challenges: 1) the one-to-many problem, where not all details about voice variability can be described in the text prompt, and 2) the limited availability of text prompt datasets, where vendors and large cost of data labeling are required to write text prompt for speech. In this work, we introduce PromptTTS 2 to address these challenges with a variation network to provide variability information of voice not captured by text prompts, and a prompt generation pipeline to utilize the large language models (LLM) to compose high quality text prompts. Specifically, the variation network predicts the representation extracted from the reference speech (which contains full information about voice) based on the text prompt representation. For the prompt generation pipeline, it generates text prompts for speech with a speech understanding model to recognize voice attributes (e.g., gender, speed) from speech and a large language model to formulate text prompt based on the recognition results. Experiments on a large-scale (44K hours) speech dataset demonstrate that compared to the previous works, PromptTTS 2 generates voices more consistent with text prompts and supports the sampling of diverse voice variability, thereby offering users more choices on voice generation. Additionally, the prompt generation pipeline produces high-quality prompts, eliminating the large labeling cost. The demo page of PromptTTS 2 is available online\footnote{https://speechresearch.github.io/prompttts2}.

翻译：语音除了文本内容外还传达更多信息，同一词语可通过不同语音传递多样化信息。相比依赖语音提示（参考语音）实现语音可变性的传统文本转语音方法，使用文本提示（描述）对用户更为友好，因为语音提示可能难以获取甚至不存在。基于文本提示的TTS方法面临两大挑战：1）“一对多”问题，即文本提示无法描述语音可变性的全部细节；2）文本提示数据集可用性有限，需要供应商和大量数据标注成本来为语音编写文本提示。本研究提出PromptTTS 2以解决上述挑战：通过变分网络提供文本提示未捕获的语音可变性信息，并构建提示生成流程利用大语言模型编撰高质量文本提示。具体而言，变分网络基于文本提示表示预测从参考语音中提取的表示（包含语音完整信息）。提示生成流程则通过语音理解模型识别语音属性（如性别、语速），并借助大语言模型基于识别结果生成文本提示。在大型（44K小时）语音数据集上的实验表明，相比已有工作，PromptTTS 2生成的语音与文本提示一致性更高，支持多样化的语音可变性采样，为用户语音生成提供更多选择。此外，提示生成流程可生成高质量提示，避免了高昂的标注成本。PromptTTS 2的演示页面已在线公开（脚注：https://speechresearch.github.io/prompttts2）。