FluentEditor2: Text-based Speech Editing by Modeling Multi-Scale Acoustic and Prosody Consistency

Text-based speech editing (TSE) allows users to edit speech by modifying the corresponding text directly without altering the original recording. Current TSE techniques often focus on minimizing discrepancies between generated speech and reference within edited regions during training to achieve fluent TSE performance. However, the generated speech in the edited region should maintain acoustic and prosodic consistency with the unedited region and the original speech at both the local and global levels. To maintain speech fluency, we propose a new fluency speech editing scheme based on our previous \textit{FluentEditor} model, termed \textit{\textbf{FluentEditor2}}, by modeling the multi-scale acoustic and prosody consistency training criterion in TSE training. Specifically, for local acoustic consistency, we propose \textit{hierarchical local acoustic smoothness constraint} to align the acoustic properties of speech frames, phonemes, and words at the boundary between the generated speech in the edited region and the speech in the unedited region. For global prosody consistency, we propose \textit{contrastive global prosody consistency constraint} to keep the speech in the edited region consistent with the prosody of the original utterance. Extensive experiments on the VCTK and LibriTTS datasets show that \textit{FluentEditor2} surpasses existing neural networks-based TSE methods, including Editspeech, Campnet, A$^3$T, FluentSpeech, and our Fluenteditor, in both subjective and objective. Ablation studies further highlight the contributions of each module to the overall effectiveness of the system. Speech demos are available at: \url{https://github.com/Ai-S2-Lab/FluentEditor2}.

翻译：基于文本的语音编辑（TSE）允许用户通过直接修改对应文本来编辑语音，而无需改变原始录音。当前的TSE技术通常在训练中侧重于最小化编辑区域内生成语音与参考语音之间的差异，以实现流畅的TSE性能。然而，编辑区域内的生成语音应在局部和全局层面与未编辑区域及原始语音保持声学和韵律的一致性。为保持语音流畅性，我们在先前\textit{FluentEditor}模型的基础上，提出一种新的流畅语音编辑方案，称为\textbf{FluentEditor2}，通过在TSE训练中建模多尺度声学与韵律一致性训练准则。具体而言，针对局部声学一致性，我们提出\textit{分层局部声学平滑约束}，以在编辑区域生成语音与未编辑区域语音的边界处对齐语音帧、音素和词语的声学特性。针对全局韵律一致性，我们提出\textit{对比全局韵律一致性约束}，以保持编辑区域语音与原始话语的韵律一致。在VCTK和LibriTTS数据集上的大量实验表明，\textit{FluentEditor2}在主客观评估上均超越了现有的基于神经网络的TSE方法，包括Editspeech、Campnet、A$^3$T、FluentSpeech以及我们的FluentEditor。消融研究进一步凸显了各模块对系统整体有效性的贡献。语音演示可在以下网址获取：\url{https://github.com/Ai-S2-Lab/FluentEditor2}。

相关内容

TSE

关注 0

IEEE软件工程事务处理对定义明确的理论结果和对软件的构建、分析或管理有潜在影响的实证研究感兴趣。这些交易的范围从制定原则的机制到将这些原则应用到具体环境。具体的主题领域包括：a）开发和维护方法和模型，例如软件系统的规范、设计和实现的技术和原则，包括符号和过程模型；b）评估方法，例如软件测试和验证、可靠性模型、测试和诊断程序，用于错误控制的软件冗余和设计，以及过程和产品各个方面的测量和评估；c）软件项目管理，例如生产力因素、成本模型、进度和组织问题、标准；d）工具和环境，例如特定工具，集成工具环境，包括相关的体系结构、数据库、并行和分布式处理问题；e）系统问题，例如硬件-软件权衡；f）最新调查，提供对某一特定关注领域历史发展的综合和全面审查。官网地址：http://dblp.uni-trier.de/db/journals/tse/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日