The objective of Audio-Visual Segmentation (AVS) is to localise the sounding objects within visual scenes by accurately predicting pixel-wise segmentation masks. To tackle the task, it involves a comprehensive consideration of both the data and model aspects. In this paper, first, we initiate a novel pipeline for generating artificial data for the AVS task without human annotating. We leverage existing image segmentation and audio datasets to match the image-mask pairs with its corresponding audio samples with the linkage of category labels, that allows us to effortlessly compose (image, audio, mask) triplets for training AVS models. The pipeline is annotation-free and scalable to cover a large number of categories. Additionally, we introduce a lightweight approach SAMA-AVS to adapt the pre-trained segment anything model~(SAM) to the AVS task. By introducing only a small number of trainable parameters with adapters, the proposed model can effectively achieve adequate audio-visual fusion and interaction in the encoding stage with vast majority of parameters fixed. We conduct extensive experiments, and the results show our proposed model remarkably surpasses other competing methods. Moreover, by using the proposed model pretrained with our synthetic data, the performance on real AVSBench data is further improved, achieving 83.17 mIoU on S4 subset and 66.95 mIoU on MS3 set.
翻译:音频-视觉分割(AVS)的目标是通过准确预测像素级分割掩码,在视觉场景中定位发声物体。为了解决该任务,需要综合考虑数据和模型两个方面。本文首先提出了一种无需人工标注即可为AVS任务生成人工数据的新流程。我们利用现有的图像分割和音频数据集,通过类别标签的关联,将图像-掩码对与其对应的音频样本进行匹配,从而能够轻松构建用于训练AVS模型的(图像、音频、掩码)三元组。该流程无需标注且可扩展至大量类别。此外,我们引入了一种轻量级方法SAMA-VS,将预训练的任意分割模型(SAM)适配到AVS任务中。通过仅引入少量可训练参数的适配器,所提模型能够在编码阶段有效实现充分的音频-视觉融合与交互,同时固定大部分参数。我们进行了大量实验,结果表明所提模型显著优于其他竞争方法。此外,使用合成数据预训练所提模型后,其在真实AVSBench数据上的性能进一步提升,在S4子集上达到83.17 mIoU,在MS3子集上达到66.95 mIoU。