Recent advancements in speech synthesis models, trained on extensive datasets, have demonstrated remarkable zero-shot capabilities. These models can control content, timbre, and emotion in generated speech based on prompt inputs. Despite these advancements, the choice of prompts significantly impacts the output quality, yet most existing selection schemes do not adequately address the control of emotional intensity. To address this question, this paper proposes a two-stage prompt selection strategy EmoPro, which is specifically designed for emotionally controllable speech synthesis. This strategy focuses on selecting highly expressive and high-quality prompts by evaluating them from four perspectives: emotional expression strength, speech quality, text-emotion consistency, and model generation performance. Experimental results show that prompts selected using the proposed method result in more emotionally expressive and engaging synthesized speech compared to those obtained through baseline. Audio samples and codes will be available at https://whyrrrrun.github.io/EmoPro/.
翻译:近年来,基于大规模数据集训练的语音合成模型已展现出卓越的零样本能力。这些模型能够根据提示输入控制生成语音的内容、音色与情感。尽管取得了这些进展,提示词的选择仍显著影响输出质量,而现有的大多数选择方案未能充分解决情感强度的控制问题。针对此问题,本文提出一种专为情感可控语音合成设计的两阶段提示选择策略EmoPro。该策略通过从情感表达强度、语音质量、文本-情感一致性及模型生成性能四个维度评估提示词,筛选出表现力强且质量高的提示。实验结果表明,相较于基线方法,采用本方法选取的提示词能生成更具情感表现力与感染力的合成语音。音频样本与代码将在https://whyrrrrun.github.io/EmoPro/公开。