Audio-Visual Segmentation (AVS) aims to extract the sounding object from a video frame, which is represented by a pixel-wise segmentation mask. The pioneering work conducts this task through dense feature-level audio-visual interaction, which ignores the dimension gap between different modalities. More specifically, the audio clip could only provide a Global semantic label in each sequence, but the video frame covers multiple semantic objects across different Local regions, which leads to mislocalization of the representationally similar but semantically different object. In this paper, we propose a Cross-modal Cognitive Consensus guided Network (C3N) to align the audio-visual semantics from the global dimension and progressively inject them into the local regions via an attention mechanism. Firstly, a Cross-modal Cognitive Consensus Inference Module (C3IM) is developed to extract a unified-modal label by integrating audio/visual classification confidence and similarities of modality-specific label embeddings. Then, we feed the unified-modal label back to the visual backbone as the explicit semantic-level guidance via a Cognitive Consensus guided Attention Module (CCAM), which highlights the local features corresponding to the interested object. Extensive experiments on the Single Sound Source Segmentation (S4) setting and Multiple Sound Source Segmentation (MS3) setting of the AVSBench dataset demonstrate the effectiveness of the proposed method, which achieves state-of-the-art performance. Code is available at https://github.com/ZhaofengSHI/AVS-C3N.
翻译:音频-视觉分割(AVS)旨在从视频帧中提取发声对象,并以像素级分割掩码的形式呈现。现有开创性工作通过密集特征级音频-视觉交互完成该任务,但忽略了不同模态间的维度差异。具体而言,音频片段在每个序列中仅能提供全局语义标签,而视频帧则覆盖了不同局部区域的多个语义对象,这导致表征相似但语义不同的对象出现定位错误。本文提出跨模态认知共识引导网络(C3N),从全局维度对齐音频-视觉语义,并通过注意力机制逐步将其注入局部区域。首先,开发了跨模态认知共识推理模块(C3IM),通过整合音频/视觉分类置信度与模态特定标签嵌入的相似性,提取统一模态标签;随后,通过认知共识引导的注意力模块(CCAM)将该统一模态标签作为显式语义级引导反馈至视觉主干网络,从而突出感兴趣对象对应的局部特征。在AVSBench数据集的单声源分割(S4)和多声源分割(MS3)设置下开展的大量实验表明,所提方法有效且达到了最优性能。代码已开源至https://github.com/ZhaofengSHI/AVS-C3N。