The primary aim of Audio-Visual Segmentation (AVS) is to precisely identify and locate auditory elements within visual scenes by accurately predicting segmentation masks at the pixel level. Achieving this involves comprehensively considering data and model aspects to address this task effectively. This study presents a lightweight approach, SAVE, which efficiently adapts the pre-trained segment anything model (SAM) to the AVS task. By incorporating an image encoder adapter into the transformer blocks to better capture the distinct dataset information and proposing a residual audio encoder adapter to encode the audio features as a sparse prompt, our proposed model achieves effective audio-visual fusion and interaction during the encoding stage. Our proposed method accelerates the training and inference speed by reducing the input resolution from 1024 to 256 pixels while achieving higher performance compared with the previous SOTA. Extensive experimentation validates our approach, demonstrating that our proposed model outperforms other SOTA methods significantly. Moreover, leveraging the pre-trained model on synthetic data enhances performance on real AVSBench data, achieving 84.59 mIoU on the S4 (V1S) subset and 70.28 mIoU on the MS3 (V1M) set with only 256 pixels for input images. This increases up to 86.16 mIoU on the S4 (V1S) and 70.83 mIoU on the MS3 (V1M) with inputs of 1024 pixels.
翻译:音频-视觉分割(AVS)的主要目标是通过在像素级别精确预测分割掩码,来准确识别和定位视觉场景中的听觉元素。有效完成此任务需要从数据和模型两方面进行综合考量。本研究提出了一种轻量级方法SAVE,该方法能够高效地将预训练的Segment Anything模型(SAM)适配到AVS任务中。通过向Transformer模块中引入图像编码器适配器以更好地捕捉特定数据集信息,并提出残差音频编码器适配器将音频特征编码为稀疏提示,我们提出的模型在编码阶段实现了有效的音视频融合与交互。我们提出的方法通过将输入分辨率从1024像素降低至256像素,加速了训练和推理速度,同时相比先前的最优方法(SOTA)取得了更高的性能。大量实验验证了我们的方法,表明我们提出的模型显著优于其他SOTA方法。此外,利用在合成数据上预训练的模型提升了在真实AVSBench数据上的性能,在输入图像仅为256像素的情况下,在S4(V1S)子集上达到了84.59 mIoU,在MS3(V1M)集上达到了70.28 mIoU。当输入图像为1024像素时,该性能进一步提升至S4(V1S)上的86.16 mIoU和MS3(V1M)上的70.83 mIoU。