Text-based speech editing (TSE) allows users to edit speech by modifying the corresponding text directly without altering the original recording. Current TSE techniques often focus on minimizing discrepancies between generated speech and reference within edited regions during training to achieve fluent TSE performance. However, the generated speech in the edited region should maintain acoustic and prosodic consistency with the unedited region and the original speech at both the local and global levels. To maintain speech fluency, we propose a new fluency speech editing scheme based on our previous \textit{FluentEditor} model, termed \textit{\textbf{FluentEditor2}}, by modeling the multi-scale acoustic and prosody consistency training criterion in TSE training. Specifically, for local acoustic consistency, we propose \textit{hierarchical local acoustic smoothness constraint} to align the acoustic properties of speech frames, phonemes, and words at the boundary between the generated speech in the edited region and the speech in the unedited region. For global prosody consistency, we propose \textit{contrastive global prosody consistency constraint} to keep the speech in the edited region consistent with the prosody of the original utterance. Extensive experiments on the VCTK and LibriTTS datasets show that \textit{FluentEditor2} surpasses existing neural networks-based TSE methods, including Editspeech, Campnet, A$^3$T, FluentSpeech, and our Fluenteditor, in both subjective and objective. Ablation studies further highlight the contributions of each module to the overall effectiveness of the system. Speech demos are available at: \url{https://github.com/Ai-S2-Lab/FluentEditor2}.
翻译:基于文本的语音编辑(TSE)允许用户通过直接修改对应文本来编辑语音,而无需改变原始录音。当前的TSE技术通常在训练中侧重于最小化编辑区域内生成语音与参考语音之间的差异,以实现流畅的TSE性能。然而,编辑区域内的生成语音应在局部和全局层面与未编辑区域及原始语音保持声学和韵律的一致性。为保持语音流畅性,我们在先前\textit{FluentEditor}模型的基础上,提出一种新的流畅语音编辑方案,称为\textbf{FluentEditor2},通过在TSE训练中建模多尺度声学与韵律一致性训练准则。具体而言,针对局部声学一致性,我们提出\textit{分层局部声学平滑约束},以在编辑区域生成语音与未编辑区域语音的边界处对齐语音帧、音素和词语的声学特性。针对全局韵律一致性,我们提出\textit{对比全局韵律一致性约束},以保持编辑区域语音与原始话语的韵律一致。在VCTK和LibriTTS数据集上的大量实验表明,\textit{FluentEditor2}在主客观评估上均超越了现有的基于神经网络的TSE方法,包括Editspeech、Campnet、A$^3$T、FluentSpeech以及我们的FluentEditor。消融研究进一步凸显了各模块对系统整体有效性的贡献。语音演示可在以下网址获取:\url{https://github.com/Ai-S2-Lab/FluentEditor2}。