In the realm of expressive Text-to-Speech (TTS), explicit prosodic boundaries significantly advance the naturalness and controllability of synthesized speech. While human prosody annotation contributes a lot to the performance, it is a labor-intensive and time-consuming process, often resulting in inconsistent outcomes. Despite the availability of extensive supervised data, the current benchmark model still faces performance setbacks. To address this issue, a two-stage automatic annotation pipeline is novelly proposed in this paper. Specifically, in the first stage, we propose contrastive text-speech pretraining of Speech-Silence and Word-Punctuation (SSWP) pairs. The pretraining procedure hammers at enhancing the prosodic space extracted from joint text-speech space. In the second stage, we build a multi-modal prosody annotator, which consists of pretrained encoders, a straightforward yet effective text-speech feature fusion scheme, and a sequence classifier. Extensive experiments conclusively demonstrate that our proposed method excels at automatically generating prosody annotation and achieves state-of-the-art (SOTA) performance. Furthermore, our novel model has exhibited remarkable resilience when tested with varying amounts of data.
翻译:摘要:在富有表现力的文本转语音(TTS)领域,显式的韵律边界显著提升了合成语音的自然度和可控性。尽管人工韵律标注对性能贡献显著,但其过程劳动密集、耗时较长,且常导致结果不一致。即便存在大量有监督数据,当前基准模型仍面临性能瓶颈。为解决该问题,本文创新性地提出了一种两阶段自动标注流程。具体而言,第一阶段中,我们提出了语音-静默与词语-标点(SSWP)对的对比文本-语音预训练方法。该预训练过程旨在增强从文本-语音联合空间中提取的韵律特征空间。第二阶段中,我们构建了一个多模态韵律标注器,其包含预训练编码器、一种简洁高效的文本-语音特征融合方案以及序列分类器。大量实验明确证明,所提方法在自动生成韵律标注方面表现优异,并达到了最先进的(SOTA)性能。此外,在不同数据量测试下,本模型展现出卓越的鲁棒性。