We propose DAVIS, a Diffusion-based Audio-VIsual Separation framework that solves the audio-visual sound source separation task through generative learning. Existing methods typically frame sound separation as a mask-based regression problem, achieving significant progress. However, they face limitations in capturing the complex data distribution required for high-quality separation of sounds from diverse categories. In contrast, DAVIS leverages a generative diffusion model and a Separation U-Net to synthesize separated sounds directly from Gaussian noise, conditioned on both the audio mixture and the visual information. With its generative objective, DAVIS is better suited to achieving the goal of high-quality sound separation across diverse sound categories. We compare DAVIS to existing state-of-the-art discriminative audio-visual separation methods on the AVE and MUSIC datasets, and results show that DAVIS outperforms other methods in separation quality, demonstrating the advantages of our framework for tackling the audio-visual source separation task.
翻译:我们提出了DAVIS,一种基于扩散模型的视听分离框架,该框架通过生成式学习解决视听声源分离任务。现有方法通常将声音分离视为基于掩码的回归问题,并取得了显著进展。然而,它们在捕捉复杂数据分布以高质量分离多类别声音方面存在局限。相比之下,DAVIS利用生成式扩散模型和一个分离U-Net,以混合音频和视觉信息为条件,直接从高斯噪声中合成分离后的声音。凭借其生成式目标,DAVIS更适用于实现跨多样声音类别的高质量声音分离目标。我们在AVE和MUSIC数据集上将DAVIS与现有的最先进判别式视听分离方法进行比较,结果表明DAVIS在分离质量上优于其他方法,这证明了我们的框架在处理视听声源分离任务方面的优势。