Diffusion-based Frameworks for Unsupervised Speech Enhancement

This paper addresses $\textit{unsupervised}$ diffusion-based single-channel speech enhancement (SE). Prior work in this direction combines a score-based diffusion model trained on clean speech with a Gaussian noise model whose covariance is structured by non-negative matrix factorization (NMF). This combination is used within an iterative expectation-maximization (EM) scheme, in which a diffusion-based posterior-sampling E-step estimates the clean speech. We first revisit this framework and propose to explicitly model both speech and acoustic noise as latent variables, jointly sampling them in the E-step instead of sampling speech alone as in previous approaches. We then introduce a new unsupervised SE framework that replaces the NMF noise prior with a diffusion-based noise model, learned jointly with the speech prior in a single conditional score model. Within this framework, we derive two variants: one that implicitly accounts for noise and one that explicitly treats noise as a latent variable. Experiments on WSJ0-QUT and VoiceBank-DEMAND show that explicit noise modeling systematically improves SE performance for both NMF-based and diffusion-based noise priors. Under matched conditions, the diffusion-based noise model attains the best overall quality and intelligibility among unsupervised methods, while under mismatched conditions the proposed NMF-based explicit-noise framework is more robust and suffers less degradation than several supervised baselines. Our code will be publicly available on this $\href{https://github.com/jeaneudesAyilo/enudiffuse}{URL}$.

翻译：本文研究基于扩散模型的$\textit{无监督}$单通道语音增强问题。该领域先前的工作将基于纯净语音训练的分数扩散模型与通过非负矩阵分解构建协方差结构的高斯噪声模型相结合，并在迭代期望最大化框架中运用该组合——其中基于扩散的后验采样E步骤用于估计纯净语音。我们首先重新审视该框架，提出将语音与声学噪声同时建模为潜变量，在E步骤中对二者进行联合采样，而非如先前方法仅对语音单独采样。随后，我们提出一种新的无监督语音增强框架，该框架采用基于扩散的噪声模型替代NMF噪声先验，并与语音先验通过单一条件分数模型进行联合学习。在此框架下，我们推导出两种变体：一种隐式处理噪声，另一种将噪声显式作为潜变量。在WSJ0-QUT和VoiceBank-DEMAND数据集上的实验表明，显式噪声建模能系统提升基于NMF和基于扩散的噪声先验的语音增强性能。在匹配条件下，基于扩散的噪声模型在无监督方法中获得了最佳的整体质量与可懂度；而在失配条件下，所提出的基于NMF的显式噪声框架展现出更强的鲁棒性，其性能下降程度低于多种有监督基线方法。我们的代码将通过此$\href{https://github.com/jeaneudesAyilo/enudiffuse}{URL}$公开提供。