Audio-Visual Segmentation with Semantics

from arxiv, Submitted to TPAMI as a journal extension of ECCV 2022. Jinxing Zhou, Xuyang Shen, and Jianyuan Wang contribute equally to this work. Meng Wang and Yiran Zhong are the corresponding authors. Code is available at https://github.com/OpenNLPLab/AVSBench. Online benchmark is available at http://www.avlbench.opennlplab.cn. arXiv admin note: substantial text overlap with arXiv:2207.05042

We propose a new problem called audio-visual segmentation (AVS), in which the goal is to output a pixel-level map of the object(s) that produce sound at the time of the image frame. To facilitate this research, we construct the first audio-visual segmentation benchmark, i.e., AVSBench, providing pixel-wise annotations for sounding objects in audible videos. It contains three subsets: AVSBench-object (Single-source subset, Multi-sources subset) and AVSBench-semantic (Semantic-labels subset). Accordingly, three settings are studied: 1) semi-supervised audio-visual segmentation with a single sound source; 2) fully-supervised audio-visual segmentation with multiple sound sources, and 3) fully-supervised audio-visual semantic segmentation. The first two settings need to generate binary masks of sounding objects indicating pixels corresponding to the audio, while the third setting further requires generating semantic maps indicating the object category. To deal with these problems, we propose a new baseline method that uses a temporal pixel-wise audio-visual interaction module to inject audio semantics as guidance for the visual segmentation process. We also design a regularization loss to encourage audio-visual mapping during training. Quantitative and qualitative experiments on AVSBench compare our approach to several existing methods for related tasks, demonstrating that the proposed method is promising for building a bridge between the audio and pixel-wise visual semantics. Code is available at https://github.com/OpenNLPLab/AVSBench. Online benchmark is available at http://www.avlbench.opennlplab.cn.

翻译：我们提出了一项名为音频-视觉分割（AVS）的新问题，其目标是在图像帧对应时刻，输出产生声音对象的像素级映射图。为推进此项研究，我们构建了首个音频-视觉分割基准数据集AVSBench，为可听视频中的发声对象提供了像素级标注。该数据集包含三个子集：AVSBench-object（单源子集、多源子集）和AVSBench-semantic（语义标签子集）。据此，我们研究了三类设置：1）单声源的半监督音频-视觉分割；2）多声源的全监督音频-视觉分割；3）全监督音频-视觉语义分割。前两类设置需生成发声对象的二值掩膜，以指示对应音频信号的像素，而第三类设置进一步要求生成指示对象类别的语义图。针对这些问题，我们提出了一种新的基线方法，该方法通过时序像素级音频-视觉交互模块，将音频语义作为引导注入视觉分割过程。同时，我们设计了正则化损失函数，以在训练中强化音频-视觉映射。在AVSBench上进行的定量与定性实验，将我们的方法与多种现有相关任务方法进行对比，结果表明本方法在搭建音频与像素级视觉语义之间的桥梁方面具有良好前景。代码已开源至https://github.com/OpenNLPLab/AVSBench，在线基准平台访问地址为http://www.avlbench.opennlplab.cn。