This paper addresses the challenge of audio-visual single-microphone speech separation and enhancement in the presence of real-world environmental noise. Our approach is based on generative inverse sampling, where we model clean speech and ambient noise with dedicated diffusion priors and jointly leverage them to recover all underlying sources. To achieve this, we reformulate a recent inverse sampler to match our setting. We evaluate on mixtures of 1, 2, and 3 speakers with noise and show that, despite being entirely unsupervised, our method consistently outperforms leading supervised baselines in \ac{WER} across all conditions. We further extend our framework to handle off-screen speaker separation. Moreover, the high fidelity of the separated noise component makes it suitable for downstream acoustic scene detection. Demo page: https://ssnapsicml.github.io/ssnapsicml2026/
翻译:本文针对现实环境噪声下的视听单麦克风语音分离与增强问题展开研究。我们的方法基于生成式逆采样技术,通过为纯净语音和环境噪声分别建立扩散先验模型,并联合利用这些先验来恢复所有潜在声源。为此,我们重构了近期提出的逆采样器以适配本任务设置。我们在包含噪声的1、2、3人混合语音场景中进行评估,结果表明:尽管完全无监督,我们的方法在所有测试条件下均持续优于主流监督基线模型在\ac{WER}指标上的表现。我们进一步扩展该框架以处理屏幕外说话人分离任务。此外,分离所得噪声成分的高保真特性使其适用于下游声学场景检测任务。演示页面:https://ssnapsicml.github.io/ssnapsicml2026/