The objective of Audio-Visual Segmentation (AVS) is to locate sounding objects within visual scenes by accurately predicting pixelwise segmentation masks. In this paper, we present the following contributions: (i), we propose a scalable and annotation-free pipeline for generating artificial data for the AVS task. We leverage existing image segmentation and audio datasets to draw links between category labels, image-mask pairs, and audio samples, which allows us to easily compose (image, audio, mask) triplets for training AVS models; (ii), we introduce a novel Audio-Aware Transformer (AuTR) architecture that features an audio-aware query-based transformer decoder. This architecture enables the model to search for sounding objects with the guidance of audio signals, resulting in more accurate segmentation; (iii), we present extensive experiments conducted on both synthetic and real datasets, which demonstrate the effectiveness of training AVS models with synthetic data generated by our proposed pipeline. Additionally, our proposed AuTR architecture exhibits superior performance and strong generalization ability on public benchmarks. The project page is https://jinxiang-liu.github.io/anno-free-AVS/.
翻译:音频-视觉分割(Audio-Visual Segmentation, AVS)的目标是通过精确预测逐像素分割掩码,在视觉场景中定位发声物体。本文贡献如下:(i) 提出一种可扩展且无需标注的流水线,用于生成AVS任务的人工数据。我们利用现有图像分割与音频数据集,建立类别标签、图像-掩码对与音频样本之间的关联,从而轻松构建用于训练AVS模型的(图像、音频、掩码)三元组;(ii) 引入一种新颖的音频感知Transformer(AuTR)架构,该架构采用基于音频感知查询的Transformer解码器,使模型能够在音频信号引导下搜索发声物体,实现更精准的分割;(iii) 在合成数据集与真实数据集上进行大量实验,验证了通过所提流水线生成的合成数据训练AVS模型的有效性。此外,所提出的AuTR架构在公开基准上展现出优越性能与强泛化能力。项目页面:https://jinxiang-liu.github.io/anno-free-AVS/。