This paper addresses the challenge of audio-visual single-microphone speech separation and enhancement in the presence of real-world environmental noise. Our approach is based on generative inverse sampling, where we model clean speech and ambient noise with dedicated diffusion priors and jointly leverage them to recover all underlying sources. To achieve this, reformulate a recent inverse sampler to match our setting. We evaluate on mixtures of 1, 2, and 3 speakers with noise and show that, despite being entirely unsupervised, our method consistently outperforms leading supervised baselines in WER across all conditions. We further extend our framework to handle off-screen speaker separation. Moreover, the high fidelity of the separated noise component makes it suitable for downstream detection of the acoustic scene. Code and pretrained models will become available upon acceptance. Demo page: https://ssnaps2026.github.io/ssnaps2026/
翻译:本文针对真实环境噪声下音视频单麦克风语音分离与增强的挑战性问题,提出了一种基于生成逆采样的方法。该方法对纯净语音和环境噪声分别采用专用扩散先验进行建模,并联合利用这些先验恢复所有潜在声源。为此,我们重新构建了适用于本场景的逆采样器。在含噪条件下对1、2和3名说话人的混合语音进行了评估,结果表明:尽管完全无监督,我们的方法在所有条件下的词错误率均持续优于主流监督基线方法。进一步地,我们将框架扩展至处理画外说话人分离场景。此外,分离出的噪声分量具有高保真度,适用于下游声学场景检测任务。代码与预训练模型将在论文接收后公开。演示页面:https://ssnaps2026.github.io/ssnaps2026/