Using speaker embeddings as conditioning can strengthen speech enhancement, but most methods either require clean enrollment audio or rely on embeddings extracted from noisy speech, which are fragile under noise and domain shift. We propose G-MaP-SE, a guided enhancement framework that builds a clean-speech embedding prior with a Gaussian Mixture Model (GMM) and refines a noisy conditioning embedding by matching it to this prior. The matched prior embedding is then injected into a time-frequency enhancement backbone via a lightweight gated fusion module. Experiments on VoiceBank+DEMAND and DNS Challenge 2020 datasets show that the proposed prior matching consistently outperforms noisy conditioning and substantially narrows the gap to an oracle clean-conditioning upper bound, while requiring no enrollment audio at inference time. The code, audio samples, and checkpoint are available.
翻译:使用说话人嵌入作为条件可以增强语音增强效果,但大多数方法要么需要干净的注册音频,要么依赖从带噪语音中提取的嵌入,这些嵌入在噪声和域偏移下较为脆弱。我们提出G-MaP-SE,一种引导式增强框架,通过高斯混合模型(GMM)构建干净语音嵌入先验,并通过将带噪条件嵌入与先验匹配来对其进行精炼。随后,匹配后的先验嵌入通过轻量级门控融合模块注入到时频增强主干网络中。在VoiceBank+DEMAND和DNS Challenge 2020数据集上的实验表明,所提出的先验匹配方法持续优于带噪条件,并大幅缩小了与理想干净条件上限的差距,同时在推理时无需注册音频。代码、音频样本和模型权重均已公开。