A Closer Look at Audio-Visual Semantic Segmentation

Audio-visual segmentation (AVS) is a complex task that involves accurately segmenting the corresponding sounding object based on audio-visual queries. Successful audio-visual learning requires two essential components: 1) an unbiased dataset with high-quality pixel-level multi-class labels, and 2) a model capable of effectively linking audio information with its corresponding visual object. However, these two requirements are only partially addressed by current methods, with training sets containing biased audio-visual data, and models that generalise poorly beyond this biased training set. In this work, we propose a new strategy to build cost-effective and relatively unbiased audio-visual semantic segmentation benchmarks. Our strategy, called Visual Post-production (VPO), explores the observation that it is not necessary to have explicit audio-visual pairs extracted from single video sources to build such benchmarks. We also refine the previously proposed AVSBench to transform it into the audio-visual semantic segmentation benchmark AVSBench-Single+. Furthermore, this paper introduces a new pixel-wise audio-visual contrastive learning method to enable a better generalisation of the model beyond the training set. We verify the validity of the VPO strategy by showing that state-of-the-art (SOTA) models trained with datasets built by matching audio and visual data from different sources or with datasets containing audio and visual data from the same video source produce almost the same accuracy. Then, using the proposed VPO benchmarks and AVSBench-Single+, we show that our method produces more accurate audio-visual semantic segmentation than SOTA models. Code and dataset will be available.

翻译：视听分割（AVS）是一项复杂任务，旨在基于视听查询准确分割对应的发声物体。成功的视听学习需要两个关键要素：1）包含高质量像素级多类标注的无偏数据集；2）能够有效将音频信息与其对应视觉目标关联起来的模型。然而，现有方法仅部分满足这两项要求——训练集包含有偏见的视听数据，且模型在有偏训练集之外的泛化能力较差。本文提出一种构建经济高效且相对无偏的视听语义分割基准的新策略。该策略名为视觉后期制作（VPO），基于以下发现：无需从单一视频源提取显式视听对即可构建此类基准。我们还对先前提出的AVSBench进行优化，将其转化为视听语义分割基准AVSBench-Single+。此外，本文引入一种新型像素级视听对比学习方法，使模型在训练集之外具备更强的泛化能力。通过实验证明：使用跨源音视频匹配构建的数据集与同源音视频数据集训练的当前最优（SOTA）模型，其准确率几乎相同，从而验证了VPO策略的有效性。基于提出的VPO基准和AVSBench-Single+，我们的方法相比SOTA模型实现了更精准的视听语义分割。代码与数据集将公开。