Target Speech Extraction (TSE) is a crucial task in speech processing that focuses on isolating the clean speech of a specific speaker from complex mixtures. While discriminative methods are commonly used for TSE, they can introduce distortion in terms of speech perception quality. On the other hand, generative approaches, particularly diffusion-based methods, can enhance speech quality perceptually but suffer from slower inference speed. We propose an efficient generative approach named Diffusion Conditional Expectation Model (DCEM) for TSE. It can handle multi- and single-speaker scenarios in both noisy and clean conditions. Additionally, we introduce Regenerate-DCEM (R-DCEM) that can regenerate and optimize speech quality based on pre-processed speech from a discriminative model. Our method outperforms conventional methods in terms of both intrusive and non-intrusive metrics and demonstrates notable strengths in inference efficiency and robustness to unseen tasks. Audio examples are available online (https://vivian556123.github.io/dcem).
翻译:目标语音提取(TSE)是语音处理中的关键任务,旨在从复杂混合信号中分离出特定说话人的纯净语音。判别式方法虽广泛用于TSE,但可能造成语音感知质量的失真;而生成式方法(特别是基于扩散的方法)虽能提升语音的感知质量,却存在推理速度慢的缺陷。我们提出一种高效生成式方法——扩散条件期望模型(DCEM)用于TSE,该方法可处理含噪/纯净条件下的多说话人与单说话人场景。此外,我们引入再生DCEM(R-DCEM),能基于判别式模型预处理后的语音进行质量再生与优化。该方法在侵入式与非侵入式指标上均优于传统方法,并在推理效率及对未见任务的鲁棒性方面展现出显著优势。音频示例见在线链接(https://vivian556123.github.io/dcem)。