The objective of Audio-Visual Segmentation (AVS) is to localise the sounding objects within visual scenes by accurately predicting pixel-wise segmentation masks. To tackle the task, it involves a comprehensive consideration of both the data and model aspects. In this paper, first, we initiate a novel pipeline for generating artificial data for the AVS task without extra manual annotations. We leverage existing image segmentation and audio datasets and match the image-mask pairs with its corresponding audio samples using category labels in segmentation datasets, that allows us to effortlessly compose (image, audio, mask) triplets for training AVS models. The pipeline is annotation-free and scalable to cover a large number of categories. Additionally, we introduce a lightweight model SAMA-AVS which adapts the pre-trained segment anything model~(SAM) to the AVS task. By introducing only a small number of trainable parameters with adapters, the proposed model can effectively achieve adequate audio-visual fusion and interaction in the encoding stage with vast majority of parameters fixed. We conduct extensive experiments, and the results show our proposed model remarkably surpasses other competing methods. Moreover, by using the proposed model pretrained with our synthetic data, the performance on real AVSBench data is further improved, achieving 83.17 mIoU on S4 subset and 66.95 mIoU on MS3 set. The project page is https://jinxiang-liu.github.io/anno-free-AVS/.
翻译:音频-视觉分割(AVS)的目标是通过精确预测像素级分割掩码,在视觉场景中定位发声物体。解决该任务需综合考虑数据与模型两个层面。本文首先提出了一种无需额外人工标注即可为AVS任务生成人工数据的新流程:利用现有图像分割和音频数据集,通过分割数据集中的类别标签匹配图像-掩码对与其对应的音频样本,从而轻松构建用于训练AVS模型的(图像、音频、掩码)三元组。该流程无需标注且可扩展至覆盖大量类别。此外,我们引入轻量级模型SAMA-VS,将预训练的Segment Anything Model(SAM)适配至AVS任务。通过引入少量带适配器的可训练参数,所提模型在固定绝大部分参数的情况下,仍能在编码阶段有效实现充分的音频-视觉融合与交互。大量实验表明,我们的模型显著优于其他竞品方法。进一步地,使用基于合成数据预训练的模型后,在真实AVSBench数据上的性能得到提升,在S4子集上达到83.17 mIoU,在MS3集上达到66.95 mIoU。项目页面为https://jinxiang-liu.github.io/anno-free-AVS/。