We propose a latent diffusion model with contrastive learning for audio-visual segmentation (AVS) to extensively explore the contribution of audio. We interpret AVS as a conditional generation task, where audio is defined as the conditional variable for sound producer(s) segmentation. With our new interpretation, it is especially necessary to model the correlation between audio and the final segmentation map to ensure its contribution. We introduce a latent diffusion model to our framework to achieve semantic-correlated representation learning. Specifically, our diffusion model learns the conditional generation process of the ground-truth segmentation map, leading to ground-truth aware inference when we perform the denoising process at the test stage. As a conditional diffusion model, we argue it is essential to ensure that the conditional variable contributes to model output. We then introduce contrastive learning to our framework to learn audio-visual correspondence, which is proven consistent with maximizing the mutual information between model prediction and the audio data. In this way, our latent diffusion model via contrastive learning explicitly maximizes the contribution of audio for AVS. Experimental results on the benchmark dataset verify the effectiveness of our solution. Code and results are online via our project page: https://github.com/OpenNLPLab/DiffusionAVS.
翻译:我们提出了一种融合对比学习的潜在扩散模型,用于音视频分割(AVS),以期深入探索音频的贡献。我们将AVS解释为一项条件生成任务,其中音频被定义为对发声体进行分割的条件变量。基于这一新诠释,尤为关键的是对音频与最终分割图之间的相关性进行建模,以确保其贡献。我们在框架中引入潜在扩散模型,以实现语义相关的表征学习。具体而言,我们的扩散模型学习真实分割图的条件生成过程,从而在测试阶段执行去噪时实现具有真实感知的推理。作为条件扩散模型,我们认为必须确保条件变量对模型输出产生贡献。为此,我们在框架中引入对比学习以捕捉音视频对应关系,这已被证明等同于最大化模型预测与音频数据间的互信息。通过这种方式,基于对比学习的潜在扩散模型明确量化了音频对AVS的贡献。基准数据集上的实验结果验证了我们方案的有效性。相关代码与结果可通过项目页面公开获取:https://github.com/OpenNLPLab/DiffusionAVS。