Audio-visual segmentation, sound localization, semantic-aware sounding objects localization

The audio-visual segmentation (AVS) task aims to segment sounding objects from a given video. Existing works mainly focus on fusing audio and visual features of a given video to achieve sounding object masks. However, we observed that prior arts are prone to segment a certain salient object in a video regardless of the audio information. This is because sounding objects are often the most salient ones in the AVS dataset. Thus, current AVS methods might fail to localize genuine sounding objects due to the dataset bias. In this work, we present an audio-visual instance-aware segmentation approach to overcome the dataset bias. In a nutshell, our method first localizes potential sounding objects in a video by an object segmentation network, and then associates the sounding object candidates with the given audio. We notice that an object could be a sounding object in one video but a silent one in another video. This would bring ambiguity in training our object segmentation network as only sounding objects have corresponding segmentation masks. We thus propose a silent object-aware segmentation objective to alleviate the ambiguity. Moreover, since the category information of audio is unknown, especially for multiple sounding sources, we propose to explore the audio-visual semantic correlation and then associate audio with potential objects. Specifically, we attend predicted audio category scores to potential instance masks and these scores will highlight corresponding sounding instances while suppressing inaudible ones. When we enforce the attended instance masks to resemble the ground-truth mask, we are able to establish audio-visual semantics correlation. Experimental results on the AVS benchmarks demonstrate that our method can effectively segment sounding objects without being biased to salient objects.

翻译：音视频分割（AVS）任务旨在从给定视频中分割出发声目标。现有工作主要聚焦于融合视频的音频和视觉特征以获取发声目标掩码。然而，我们观察到先前方法容易忽略音频信息而直接分割视频中的某一显著目标。这是因为AVS数据集中的发声目标往往也是场景中最显著的目标。因此，当前AVS方法可能因数据集偏差而无法准确定位真实发声目标。本文提出一种音视频实例感知分割方法来克服数据集偏差。简言之，我们的方法首先通过目标分割网络定位视频中潜在的发声目标，随后将候选发声目标与给定音频关联。值得注意的是，同一目标可能在一个视频中发声，却在另一视频中保持静默。这种属性会在训练目标分割网络时引入歧义——因为仅发声目标具有对应的分割掩码。为此，我们提出静默目标感知分割目标函数以缓解该歧义。此外，由于音频类别信息未知（尤其在多声源场景下），我们提出探索音视频语义相关性，进而将音频与潜在目标关联。具体而言，我们将预测的音频类别分数作用于候选实例掩码，这些分数将增强对应发声实例并抑制不可闻实例。当约束增强后的实例掩码与真实掩码一致时，我们便能建立音视频语义关联。在AVS基准测试上的实验结果表明，我们的方法能够有效分割发声目标，且不再受显著目标偏差影响。