Speech enhancement aims to improve the quality of speech signals in terms of quality and intelligibility, and speech editing refers to the process of editing the speech according to specific user needs. In this paper, we propose a Unified Speech Enhancement and Editing (uSee) model with conditional diffusion models to handle various tasks at the same time in a generative manner. Specifically, by providing multiple types of conditions including self-supervised learning embeddings and proper text prompts to the score-based diffusion model, we can enable controllable generation of the unified speech enhancement and editing model to perform corresponding actions on the source speech. Our experiments show that our proposed uSee model can achieve superior performance in both speech denoising and dereverberation compared to other related generative speech enhancement models, and can perform speech editing given desired environmental sound text description, signal-to-noise ratios (SNR), and room impulse responses (RIR). Demos of the generated speech are available at https://muqiaoy.github.io/usee.
翻译:语音增强旨在提高语音信号的质量和可懂度,而语音编辑则指根据特定用户需求对语音进行编辑的过程。本文提出了一种基于条件扩散模型的统一语音增强与编辑(uSee)模型,以生成方式同时处理多种任务。具体而言,通过向基于分数的扩散模型提供包括自监督学习嵌入和合适文本提示在内的多种条件,我们能够使统一语音增强与编辑模型实现可控生成,从而对源语音执行相应操作。实验表明,我们提出的uSee模型在语音去噪和去混响方面均优于其他相关生成式语音增强模型,并且能够根据所需的环境声音文本描述、信噪比(SNR)和房间脉冲响应(RIR)完成语音编辑。生成语音的示例可在 https://muqiaoy.github.io/usee 获取。