In this paper, we focus on a recently proposed novel task called Audio-Visual Segmentation (AVS), where the fine-grained correspondence between audio stream and image pixels is required to be established. However, learning such correspondence faces two key challenges: (1) audio signals inherently exhibit a high degree of information density, as sounds produced by multiple objects are entangled within the same audio stream; (2) the frequency of audio signals from objects with the same category tends to be similar, which hampers the distinction of target object and consequently leads to ambiguous segmentation results. Toward this end, we propose an Audio Unmixing and Semantic Segmentation Network (AUSS), which encourages unmixing complicated audio signals and distinguishing similar sounds. Technically, our AUSS unmixs the audio signals into a set of audio queries, and interacts them with visual features by masked attention mechanisms. To encourage these audio queries to capture distinctive features embedded within the audio, two self-supervised losses are also introduced as additional supervision at both class and mask levels. Extensive experimental results on the AVSBench benchmark show that our AUSS sets a new state-of-the-art in both single-source and multi-source subsets, demonstrating the effectiveness of our AUSS in bridging the gap between audio and vision modalities.
翻译:本文聚焦于近期提出的新型任务——音视频分割(Audio-Visual Segmentation, AVS),该任务要求建立音频流与图像像素之间的细粒度对应关系。然而,学习这种对应关系面临两个关键挑战:(1)音频信号本质上具有高度信息密度,因为多个物体产生的声音会混杂在同一音频流中;(2)同类物体发出的音频信号频率趋于相似,这阻碍了目标物体的区分,进而导致分割结果模糊不清。为此,我们提出音频解耦与语义分割网络(Audio Unmixing and Semantic Segmentation Network, AUSS),该网络可解耦复杂音频信号并区分相似声音。在技术实现上,我们的AUSS将音频信号解耦为一组音频查询,并通过掩码注意力机制使其与视觉特征交互。为促使这些音频查询捕获嵌入音频中的独特特征,我们还引入两类自监督损失作为类别层面和掩码层面的额外监督。在AVSBench基准上的大量实验结果表明,我们的AUSS在单源和多源子集上均达到最新最优水平,证明了其在弥合音频与视觉模态间鸿沟方面的有效性。