We propose to explore a new problem called audio-visual segmentation (AVS), in which the goal is to output a pixel-level map of the object(s) that produce sound at the time of the image frame. To facilitate this research, we construct the first audio-visual segmentation benchmark (AVSBench), providing pixel-wise annotations for the sounding objects in audible videos. Two settings are studied with this benchmark: 1) semi-supervised audio-visual segmentation with a single sound source and 2) fully-supervised audio-visual segmentation with multiple sound sources. To deal with the AVS problem, we propose a novel method that uses a temporal pixel-wise audio-visual interaction module to inject audio semantics as guidance for the visual segmentation process. We also design a regularization loss to encourage the audio-visual mapping during training. Quantitative and qualitative experiments on the AVSBench compare our approach to several existing methods from related tasks, demonstrating that the proposed method is promising for building a bridge between the audio and pixel-wise visual semantics. Code is available at https://github.com/OpenNLPLab/AVSBench.
翻译:我们提出探索一个名为音频-视觉分割的新问题,其目标是在图像帧对应时刻,输出产生声音的对象的像素级掩码图。为促进该研究,我们构建了首个音频-视觉分割基准数据集AVSBench,为可听视频中的发声对象提供像素级标注。该基准数据集下设置了两种研究场景:1)单声源的半监督音频-视觉分割;2)多声源的全监督音频-视觉分割。为解决AVS问题,我们提出了一种新颖方法,通过时序像素级音频-视觉交互模块将音频语义注入视觉分割过程作为引导。我们还设计了正则化损失函数,以在训练过程中加强音频-视觉映射。在AVSBench上进行的定量与定性实验,将我们的方法与若干来自相关任务的现有方法进行对比,结果表明所提方法在建立音频与像素级视觉语义之间的桥梁方面具有潜力。代码开源地址为 https://github.com/OpenNLPLab/AVSBench。