Audio-visual segmentation (AVS) is a challenging task that involves accurately segmenting sounding objects based on audio-visual cues. The effectiveness of audio-visual learning critically depends on achieving accurate cross-modal alignment between sound and visual objects. Successful audio-visual learning requires two essential components: 1) a challenging dataset with high-quality pixel-level multi-class annotated images associated with audio files, and 2) a model that can establish strong links between audio information and its corresponding visual object. However, these requirements are only partially addressed by current methods, with training sets containing biased audio-visual data, and models that generalise poorly beyond this biased training set. In this work, we propose a new cost-effective strategy to build challenging and relatively unbiased high-quality audio-visual segmentation benchmarks. We also propose a new informative sample mining method for audio-visual supervised contrastive learning to leverage discriminative contrastive samples to enforce cross-modal understanding. We show empirical results that demonstrate the effectiveness of our benchmark. Furthermore, experiments conducted on existing AVS datasets and on our new benchmark show that our method achieves state-of-the-art (SOTA) segmentation accuracy.
翻译:音频-视觉分割(AVS)是一项具有挑战性的任务,旨在根据音频-视觉线索精确分割发声物体。音频-视觉学习的有效性关键取决于实现声音与视觉对象之间的准确跨模态对齐。成功的音频-视觉学习需要两个基本要素:1)一个具有高质量像素级多类标注图像并与音频文件相关联的挑战性数据集;2)一个能够在音频信息与其对应视觉对象之间建立强关联的模型。然而,现有方法仅部分满足了这些需求:训练集包含有偏见的音频-视觉数据,且模型在此类偏见训练集之外泛化能力较差。在本工作中,我们提出了一种新的经济高效策略,用于构建具有挑战性且相对无偏见的高质量音频-视觉分割基准。我们还提出了一种新的信息性样本挖掘方法,用于音频-视觉监督对比学习,以利用具有判别性的对比样本增强跨模态理解。实验结果表明,我们构建的基准具有有效性。此外,在现有AVS数据集及我们新基准上的实验显示,我们的方法达到了最先进的(SOTA)分割精度。