Spontaneous speaking style exhibits notable differences from other speaking styles due to various spontaneous phenomena (e.g., filled pauses, prolongation) and substantial prosody variation (e.g., diverse pitch and duration variation, occasional non-verbal speech like smile), posing challenges to modeling and prediction of spontaneous style. Moreover, the limitation of high-quality spontaneous data constrains spontaneous speech generation for speakers without spontaneous data. To address these problems, we propose SponTTS, a two-stage approach based on bottleneck (BN) features to model and transfer spontaneous style for TTS. In the first stage, we adopt a Conditional Variational Autoencoder (CVAE) to capture spontaneous prosody from a BN feature and involve the spontaneous phenomena by the constraint of spontaneous phenomena embedding prediction loss. Besides, we introduce a flow-based predictor to predict a latent spontaneous style representation from the text, which enriches the prosody and context-specific spontaneous phenomena during inference. In the second stage, we adopt a VITS-like module to transfer the spontaneous style learned in the first stage to target speakers. Experiments demonstrate that SponTTS is effective in modeling spontaneous style and transferring the style to the target speakers, generating spontaneous speech with high naturalness, expressiveness, and speaker similarity. The zero-shot spontaneous style TTS test further verifies the generalization and robustness of SponTTS in generating spontaneous speech for unseen speakers.
翻译:自发说话风格因其丰富的自发现象(如填充停顿、延长音)和显著的韵律变化(如多样的基频与时长变化、偶发的非语言发声如笑声),与其它说话风格存在显著差异,这给自发风格的建模与预测带来了挑战。此外,高质量自发数据的匮乏限制了无自发数据说话人的自发语音生成能力。针对上述问题,我们提出SponTTS——一种基于瓶颈(BN)特征的两阶段方法,用于对自发风格进行建模与迁移。在第一阶段,我们采用条件变分自编码器(CVAE)从BN特征中捕获自发韵律,并通过自发现象嵌入预测损失的约束引入自发现象。同时,我们引入基于流的预测器,从文本中预测潜在的自发风格表征,从而在推理过程中丰富韵律及上下文相关的自发现象。在第二阶段,我们采用类VITS模块将第一阶段学习到的自发风格迁移至目标说话人。实验表明,SponTTS能有效建模自发风格并将其迁移至目标说话人,生成具有高自然度、表现力及说话人相似度的自发语音。零样本自发风格TTS测试进一步验证了SponTTS在未见说话人自发语音生成中的泛化性与鲁棒性。