Speech language models (LMs) are promising for high-quality speech synthesis through in-context learning. A typical speech LM takes discrete semantic units as content and a short utterance as prompt, and synthesizes speech which preserves the content's semantics but mimics the prompt's style. However, there is no systematic understanding on how the synthesized audio is controlled by the prompt and content. In this work, we conduct an empirical study of the widely used autoregressive (AR) and non-autoregressive (NAR) speech LMs and provide insights into the prompt design and content semantic units. Our analysis reveals that heterogeneous and nonstationary prompts hurt the audio quality in contrast to the previous finding that longer prompts always lead to better synthesis. Moreover, we find that the speaker style of the synthesized audio is also affected by the content in addition to the prompt. We further show that semantic units carry rich acoustic information such as pitch, tempo, volume and speech emphasis, which might be leaked from the content to the synthesized audio.
翻译:语音语言模型在通过上下文学习实现高质量语音合成方面具有广阔前景。典型的语音语言模型以离散语义单元为内容,以短语音片段为提示,通过合成保留内容语义但模仿提示风格的语音来实现功能。然而,关于合成音频如何被提示和内容控制的问题尚缺乏系统性认知。本研究针对广泛使用的自回归与非自回归语音语言模型开展实证分析,揭示了提示设计与内容语义单元的内在机理。分析表明,与以往"更长提示总能带来更优合成效果"的发现不同,异质化与非平稳性提示会损害音频质量。此外,我们发现合成音频的说话人风格除受提示影响外,还会受到内容本身的干扰。进一步研究表明,语义单元承载着音高、节奏、音量和语音强调等丰富声学信息,这些信息可能从内容泄露至合成音频中。