Recent advancements in speech synthesis have enabled large language model (LLM)-based systems to perform zero-shot generation with controllable content, timbre, speaker identity, and emotion through input prompts. As a result, these models heavily rely on prompt design to guide the generation process. However, existing prompt selection methods often fail to ensure that prompts contain sufficiently stable speaker identity cues and appropriate emotional intensity indicators, which are crucial for expressive speech synthesis. To address this challenge, we propose a two-stage prompt selection strategy specifically designed for expressive speech synthesis. In the static stage (before synthesis), we first evaluate prompt candidates using pitch-based prosodic features, perceptual audio quality, and text-emotion coherence scores evaluated by an LLM. We further assess the candidates under a specific TTS model by measuring character error rate, speaker similarity, and emotional similarity between the synthesized and prompt speech. In the dynamic stage (during synthesis), we use a textual similarity model to select the prompt that is most aligned with the current input text. Experimental results demonstrate that our strategy effectively selects prompt to synthesize speech with both high-intensity emotional expression and robust speaker identity, leading to more expressive and stable zero-shot TTS performance. Audio samples and codes will be available at https://whyrrrrun.github.io/ExpPro.github.io/.
翻译:近期语音合成技术的进展使得基于大型语言模型(LLM)的系统能够通过输入提示实现零样本生成,并可控地调节内容、音色、说话人身份和情感。因此,这些模型高度依赖提示设计来引导生成过程。然而,现有的提示选择方法往往无法确保提示中包含足够稳定的说话人身份线索和适当的情感强度指标,而这些对于富有表现力的语音合成至关重要。为解决这一挑战,我们提出了一种专门针对富有表现力语音合成的两阶段提示选择策略。在静态阶段(合成前),我们首先使用基于音高的韵律特征、感知音频质量以及由LLM评估的文本-情感一致性得分来评估候选提示。我们进一步在特定TTS模型下,通过测量合成语音与提示语音之间的字符错误率、说话人相似度和情感相似度来评估候选提示。在动态阶段(合成过程中),我们使用文本相似度模型来选择与当前输入文本最匹配的提示。实验结果表明,我们的策略能够有效选择提示,从而合成出同时具备高情感强度表达和稳健说话人身份的语音,进而实现更富表现力且更稳定的零样本TTS性能。音频样本和代码将在https://whyrrrrun.github.io/ExpPro.github.io/上提供。