Audio-Visual Segmentation (AVS) aims to extract the sounding object from a video frame, which is represented by a pixel-wise segmentation mask. The pioneering work conducts this task through dense feature-level audio-visual interaction, which ignores the dimension gap between different modalities. More specifically, the audio clip could only provide a Global semantic label in each sequence, but the video frame covers multiple semantic objects across different Local regions, which leads to mislocalization of the representationally similar but semantically different object. In this paper, we propose a Cross-modal Cognitive Consensus guided Network (C3N) to align the audio-visual semantics from the global dimension and progressively inject them into the local regions via an attention mechanism. Firstly, a Cross-modal Cognitive Consensus Inference Module (C3IM) is developed to extract a unified-modal label by integrating audio/visual classification confidence and similarities of modality-specific label embeddings. Then, we feed the unified-modal label back to the visual backbone as the explicit semantic-level guidance via a Cognitive Consensus guided Attention Module (CCAM), which highlights the local features corresponding to the interested object. Extensive experiments on the Single Sound Source Segmentation (S4) setting and Multiple Sound Source Segmentation (MS3) setting of the AVSBench dataset demonstrate the effectiveness of the proposed method, which achieves state-of-the-art performance. Code is available at https://github.com/ZhaofengSHI/AVS-C3N.
翻译:音频-视觉分割(AVS)旨在从视频帧中提取发声物体,并通过逐像素分割掩码表示。现有开创性工作通过密集特征级的音频-视觉交互完成该任务,却忽视了不同模态间的维度差异。具体而言,音频片段仅能提供每个序列中的全局语义标签,而视频帧覆盖了不同局部区域中的多个语义物体,这导致了表征相似但语义不同物体的定位偏差。本文提出跨模态认知共识引导网络(C3N),从全局维度对齐音频-视觉语义,并通过注意力机制逐步将语义注入局部区域。首先,设计跨模态认知共识推理模块(C3IM),通过整合音频/视觉分类置信度和模态特定标签嵌入的相似性,提取统一模态标签;随后,通过认知共识引导注意力模块(CCAM),将统一模态标签作为显式语义级引导反馈至视觉主干网络,以突出感兴趣物体对应的局部特征。在AVSBench数据集的单声源分割(S4)设置与多声源分割(MS3)设置上的大量实验证明了该方法的有效性,并取得了最先进的性能。代码已开源至https://github.com/ZhaofengSHI/AVS-C3N。