Diffusion model-based speech enhancement has received increased attention since it can generate very natural enhanced signals and generalizes well to unseen conditions. Diffusion models have been explored for several sub-tasks of speech enhancement, such as speech denoising, dereverberation, and source separation. In this paper, we investigate their use for target speech extraction (TSE), which consists of estimating the clean speech signal of a target speaker in a mixture of multi-talkers. TSE is realized by conditioning the extraction process on a clue identifying the target speaker. We show we can realize TSE using a conditional diffusion model conditioned on the clue. Besides, we introduce ensemble inference to reduce potential extraction errors caused by the diffusion process. In experiments on Libri2mix corpus, we show that the proposed diffusion model-based TSE combined with ensemble inference outperforms a comparable TSE system trained discriminatively.
翻译:基于扩散模型的语音增强因能生成高度自然的增强信号且对新场景泛化能力强而受到越来越多的关注。扩散模型已被探索用于语音增强的多个子任务,如语音去噪、去混响和源分离。本文研究了其在目标语音提取(TSE)中的应用,该任务旨在从多说话人混合语音中估计目标说话人的纯净语音信号。TSE通过依赖标识目标说话人的线索来条件化提取过程实现。我们证明可以利用基于条件扩散模型并依据该线索进行条件化处理来实现TSE。此外,我们引入了集成推理以减少扩散过程可能产生的提取误差。在Libri2mix语料库上的实验表明,本文提出的基于扩散模型的TSE结合集成推理方法优于采用判别式训练的同类TSE系统。