Expressive text-to-speech (TTS) aims to synthesize speeches with human-like tones, moods, or even artistic attributes. Recent advancements in expressive TTS empower users with the ability to directly control synthesis style through natural language prompts. However, these methods often require excessive training with a significant amount of style-annotated data, which can be challenging to acquire. Moreover, they may have limited adaptability due to fixed style annotations. In this work, we present FreeStyleTTS (FS-TTS), a controllable expressive TTS model with minimal human annotations. Our approach utilizes a large language model (LLM) to transform expressive TTS into a style retrieval task. The LLM selects the best-matching style references from annotated utterances based on external style prompts, which can be raw input text or natural language style descriptions. The selected reference guides the TTS pipeline to synthesize speeches with the intended style. This innovative approach provides flexible, versatile, and precise style control with minimal human workload. Experiments on a Mandarin storytelling corpus demonstrate FS-TTS's proficiency in leveraging LLM's semantic inference ability to retrieve desired styles from either input text or user-defined descriptions. This results in synthetic speeches that are closely aligned with the specified styles.
翻译:表现力文本转语音旨在合成具有人类语调、情绪甚至艺术属性的语音。近期表现力TTS的进展使用户能够通过自然语言提示直接控制合成风格。然而,这些方法往往需要大量风格标注数据进行过度训练,此类数据获取困难。此外,由于固定风格标注的限制,这些方法的适应性可能有限。本文提出FreeStyleTTS(FS-TTS),一种基于最少人工标注的可控表现力TTS模型。该方法利用大语言模型将表现力TTS转化为风格检索任务。LLM根据外部风格提示(可为原始输入文本或自然语言风格描述),从标注话语中选取最匹配的风格参考。所选参考引导TTS流水线合成具有目标风格的语音。这一创新方法以最小人工工作量实现灵活、多样且精确的风格控制。在普通话故事讲述语料库上的实验表明,FS-TTS能有效利用LLM的语义推理能力,从输入文本或用户定义描述中检索所需风格,从而合成与指定风格高度一致的语音。