Text to Speech (TTS) models can generate natural and high-quality speech, but it is not expressive enough when synthesizing speech with dramatic expressiveness, such as stand-up comedies. Considering comedians have diverse personal speech styles, including personal prosody, rhythm, and fillers, it requires real-world datasets and strong speech style modeling capabilities, which brings challenges. In this paper, we construct a new dataset and develop ComedicSpeech, a TTS system tailored for the stand-up comedy synthesis in low-resource scenarios. First, we extract prosody representation by the prosody encoder and condition it to the TTS model in a flexible way. Second, we enhance the personal rhythm modeling by a conditional duration predictor. Third, we model the personal fillers by introducing comedian-related special tokens. Experiments show that ComedicSpeech achieves better expressiveness than baselines with only ten-minute training data for each comedian. The audio samples are available at https://xh621.github.io/stand-up-comedy-demo/
翻译:文本转语音(TTS)模型能够生成自然且高质量的语音,但在合成具有戏剧表现力的语音(如单口喜剧)时,其表现力仍显不足。鉴于喜剧演员具有多样化的个人语言风格,包括个性化韵律、节奏和填充词,这需要真实数据集和强大的语音风格建模能力,从而带来了挑战。本文构建了一个新数据集,并开发了ComedicSpeech——一种专为低资源场景下单口喜剧合成而设计的TTS系统。首先,我们通过韵律编码器提取韵律表征,并以灵活方式将其条件输入TTS模型。其次,我们通过条件持续时间预测器增强个性化节奏建模。第三,我们通过引入喜剧演员相关的特殊标记来建模个性化填充词。实验表明,ComedicSpeech在每位喜剧演员仅需十分钟训练数据的条件下,其表现力优于基线模型。音频样本可访问:https://xh621.github.io/stand-up-comedy-demo/