Conversational speech synthesis (CSS) incorporates historical dialogue as supplementary information with the aim of generating speech that has dialogue-appropriate prosody. While previous methods have already delved into enhancing context comprehension, context representation still lacks effective representation capabilities and context-sensitive discriminability. In this paper, we introduce a contrastive learning-based CSS framework, CONCSS. Within this framework, we define an innovative pretext task specific to CSS that enables the model to perform self-supervised learning on unlabeled conversational datasets to boost the model's context understanding. Additionally, we introduce a sampling strategy for negative sample augmentation to enhance context vectors' discriminability. This is the first attempt to integrate contrastive learning into CSS. We conduct ablation studies on different contrastive learning strategies and comprehensive experiments in comparison with prior CSS systems. Results demonstrate that the synthesized speech from our proposed method exhibits more contextually appropriate and sensitive prosody.
翻译:对话语音合成(CSS)利用历史对话作为补充信息,旨在生成具有对话适宜韵律的语音。尽管已有方法深入探索了上下文理解的增强,但上下文表示仍缺乏有效的表示能力和上下文敏感的可区分性。本文提出一种基于对比学习的CSS框架——CONCSS。在该框架中,我们定义了一种专门适用于CSS的创新前置任务,使模型能够在无标签对话数据集上进行自监督学习以提升上下文理解能力。此外,我们引入一种用于负样本增强的采样策略,以增强上下文向量的可区分性。这是将对比学习应用于CSS的首次尝试。我们针对不同对比学习策略进行了消融研究,并与现有CSS系统进行了全面实验对比。结果表明,我们方法合成的语音具有更符合上下文且更敏感的韵律。