A Closer Look at Audio-Visual Semantic Segmentation

Audio-visual segmentation (AVS) is a complex task that involves accurately segmenting the corresponding sounding object based on audio-visual queries. Successful audio-visual learning requires two essential components: 1) an unbiased dataset with high-quality pixel-level multi-class labels, and 2) a model capable of effectively linking audio information with its corresponding visual object. However, these two requirements are only partially addressed by current methods, with training sets containing biased audio-visual data, and models that generalise poorly beyond this biased training set. In this work, we propose a new strategy to build cost-effective and relatively unbiased audio-visual semantic segmentation benchmarks. Our strategy, called Visual Post-production (VPO), explores the observation that it is not necessary to have explicit audio-visual pairs extracted from single video sources to build such benchmarks. We also refine the previously proposed AVSBench to transform it into the audio-visual semantic segmentation benchmark AVSBench-Single+. Furthermore, this paper introduces a new pixel-wise audio-visual contrastive learning method to enable a better generalisation of the model beyond the training set. We verify the validity of the VPO strategy by showing that state-of-the-art (SOTA) models trained with datasets built by matching audio and visual data from different sources or with datasets containing audio and visual data from the same video source produce almost the same accuracy. Then, using the proposed VPO benchmarks and AVSBench-Single+, we show that our method produces more accurate audio-visual semantic segmentation than SOTA models. Code and dataset will be available.

翻译：音频-视觉分割（AVS）是一项复杂任务，涉及基于音频-视觉查询准确分割对应的发声物体。成功的音频-视觉学习需要两个核心要素：1）一个具有高质量像素级多类标注的无偏数据集；2）一个能够有效关联音频信息与其对应视觉对象的模型。然而，现有方法仅部分满足这两个要求：训练集包含有偏的音频-视觉数据，而模型在该有偏训练集之外的泛化能力较差。本文提出了一种构建经济且相对无偏的音频-视觉语义分割基准的新策略。该策略名为视觉后期制作（VPO），其核心发现是：构建此类基准时无需从单一视频源中提取显式的音频-视觉配对数据。我们还对先前提出的AVSBench进行了优化，将其转化为音频-视觉语义分割基准AVSBench-Single+。此外，本文引入了一种新的像素级音频-视觉对比学习方法，使模型在训练集之外具备更好的泛化能力。我们通过实验验证了VPO策略的有效性：使用匹配不同来源音频与视觉数据构建的数据集进行训练的最先进（SOTA）模型，与使用同一视频源音频-视觉数据训练得到的模型，在精度上几乎一致。随后，基于所提VPO基准和AVSBench-Single+，我们证明了该方法在音频-视觉语义分割任务上优于SOTA模型。代码与数据集将公开提供。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日