The objective of Audio-Visual Segmentation (AVS) is to locate sounding objects within visual scenes by accurately predicting pixelwise segmentation masks. In this paper, we present the following contributions: (i), we propose a scalable and annotation-free pipeline for generating artificial data for the AVS task. We leverage existing image segmentation and audio datasets to draw links between category labels, image-mask pairs, and audio samples, which allows us to easily compose (image, audio, mask) triplets for training AVS models; (ii), we introduce a novel Audio-Aware Transformer (AuTR) architecture that features an audio-aware query-based transformer decoder. This architecture enables the model to search for sounding objects with the guidance of audio signals, resulting in more accurate segmentation; (iii), we present extensive experiments conducted on both synthetic and real datasets, which demonstrate the effectiveness of training AVS models with synthetic data generated by our proposed pipeline. Additionally, our proposed AuTR architecture exhibits superior performance and strong generalization ability on public benchmarks. The project page is https://jinxiang-liu.github.io/anno-free-AVS/.
翻译:音频-视觉分割(Audio-Visual Segmentation, AVS)的目标是通过精确预测像素级分割掩码,在视觉场景中定位发声物体。本文提出以下贡献:(i)提出了一种可扩展且无需人工标注的流程,用于为AVS任务生成人工数据。我们利用现有的图像分割和音频数据集,建立类别标签、图像-掩码对与音频样本之间的关联,从而能够轻松构建用于训练AVS模型的(图像、音频、掩码)三元组;(ii)引入了一种新颖的音频感知Transformer(Audio-Aware Transformer, AuTR)架构,该架构包含一个基于音频感知查询的Transformer解码器。此架构使模型能够在音频信号的引导下搜索发声物体,实现更精准的分割;(iii)在合成数据集和真实数据集上进行了大量实验,结果表明,使用我们所提流程生成的合成数据训练AVS模型具有显著效果。此外,我们提出的AuTR架构在公开基准测试中展现出优越的性能和强泛化能力。项目页面:https://jinxiang-liu.github.io/anno-free-AVS/。