Cross-modal Cognitive Consensus guided Audio-Visual Segmentation

Audio-Visual Segmentation (AVS) aims to extract the sounding object from a video frame, which is represented by a pixel-wise segmentation mask. The pioneering work conducts this task through dense feature-level audio-visual interaction, which ignores the dimension gap between different modalities. More specifically, the audio clip could only provide a \textit{Global} semantic label in each sequence, but the video frame covers multiple semantic objects across different \textit{Local} regions. In this paper, we propose a Cross-modal Cognitive Consensus guided Network (C3N) to align the audio-visual semantics from the global dimension and progressively inject them into the local regions via an attention mechanism. Firstly, a Cross-modal Cognitive Consensus Inference Module (C3IM) is developed to extract a unified-modal label by integrating audio/visual classification confidence and similarities of modality-specific label embeddings. Then, we feed the unified-modal label back to the visual backbone as the explicit semantic-level guidance via a Cognitive Consensus guided Attention Module (CCAM), which highlights the local features corresponding to the interested object. Extensive experiments on the Single Sound Source Segmentation (S4) setting and Multiple Sound Source Segmentation (MS3) setting of the AVSBench dataset demonstrate the effectiveness of the proposed method, which achieves state-of-the-art performance.

翻译：视听分割（Audio-Visual Segmentation, AVS）旨在从视频帧中提取发声物体，该物体由像素级分割掩码表示。开创性工作通过密集的特征级视听交互来执行此任务，这忽略了不同模态之间的维度差距。具体而言，音频片段在每个序列中仅能提供全局语义标签，而视频帧则覆盖了不同局部区域中的多个语义对象。本文提出了一种跨模态认知共识引导网络（Cross-modal Cognitive Consensus guided Network, C3N），以从全局维度对齐视听语义，并通过注意力机制逐步将其注入局部区域。首先，开发了跨模态认知共识推理模块（Cross-modal Cognitive Consensus Inference Module, C3IM），通过整合视听分类置信度及模态特定标签嵌入的相似性，提取统一模态标签。然后，通过认知共识引导注意力模块（Cognitive Consensus guided Attention Module, CCAM），将统一模态标签作为显式语义级指导反馈至视觉骨干网络，该模块能够突显感兴趣对象对应的局部特征。在AVSBench数据集的单声源分割（Single Sound Source Segmentation, S4）设置和多声源分割（Multiple Sound Source Segmentation, MS3）设置上进行的广泛实验表明，所提方法具有有效性，并达到了最先进的性能。

相关内容

Cognition

关注 4

Cognition：Cognition：International Journal of Cognitive Science Explanation：认知：国际认知科学杂志。 Publisher：Elsevier。 SIT： http://www.journals.elsevier.com/cognition/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日