Audio-visual segmentation (AVS) is a challenging task that involves accurately segmenting sounding objects based on audio-visual cues. The effectiveness of audio-visual learning critically depends on achieving accurate cross-modal alignment between sound and visual objects. Successful audio-visual learning requires two essential components: 1) a challenging dataset with high-quality pixel-level multi-class annotated images associated with audio files, and 2) a model that can establish strong links between audio information and its corresponding visual object. However, these requirements are only partially addressed by current methods, with training sets containing biased audio-visual data, and models that generalise poorly beyond this biased training set. In this work, we propose a new cost-effective strategy to build challenging and relatively unbiased high-quality audio-visual segmentation benchmarks. We also propose a new informative sample mining method for audio-visual supervised contrastive learning to leverage discriminative contrastive samples to enforce cross-modal understanding. We show empirical results that demonstrate the effectiveness of our benchmark. Furthermore, experiments conducted on existing AVS datasets and on our new benchmark show that our method achieves state-of-the-art (SOTA) segmentation accuracy.
翻译:音频-视觉分割(AVS)是一项具有挑战性的任务,涉及基于音频-视觉线索精确分割发声对象。音频-视觉学习的有效性关键取决于实现声音与视觉对象之间准确的跨模态对齐。成功的音频-视觉学习需要两个基本要素:1)一个具有挑战性的数据集,包含与音频文件关联的高质量像素级多类标注图像;2)一个能够在音频信息及其对应视觉对象之间建立强关联的模型。然而,当前方法仅部分解决了这些需求,训练集包含有偏见的音频-视觉数据,且模型在此类偏见训练集之外泛化能力较差。本文提出了一种新的经济高效的策略,以构建具有挑战性且相对无偏的高质量音频-视觉分割基准。我们还提出了一种新的信息样本挖掘方法,用于音频-视觉监督对比学习,以利用具有判别性的对比样本来增强跨模态理解。实证结果证明了我们基准的有效性。此外,在现有AVS数据集及我们所建新基准上进行的实验表明,我们的方法达到了当前最优(SOTA)分割精度。