SAVE: Segment Audio-Visual Easy way using Segment Anything Model

The primary aim of Audio-Visual Segmentation (AVS) is to precisely identify and locate auditory elements within visual scenes by accurately predicting segmentation masks at the pixel level. Achieving this involves comprehensively considering data and model aspects to address this task effectively. This study presents a lightweight approach, SAVE, which efficiently adapts the pre-trained segment anything model (SAM) to the AVS task. By incorporating an image encoder adapter into the transformer blocks to better capture the distinct dataset information and proposing a residual audio encoder adapter to encode the audio features as a sparse prompt, our proposed model achieves effective audio-visual fusion and interaction during the encoding stage. Our proposed method accelerates the training and inference speed by reducing the input resolution from 1024 to 256 pixels while achieving higher performance compared with the previous SOTA. Extensive experimentation validates our approach, demonstrating that our proposed model outperforms other SOTA methods significantly. Moreover, leveraging the pre-trained model on synthetic data enhances performance on real AVSBench data, achieving 84.59 mIoU on the S4 (V1S) subset and 70.28 mIoU on the MS3 (V1M) set with only 256 pixels for input images. This increases up to 86.16 mIoU on the S4 (V1S) and 70.83 mIoU on the MS3 (V1M) with inputs of 1024 pixels.

翻译：音频-视觉分割（AVS）的主要目标是通过在像素级别精确预测分割掩码，来准确识别和定位视觉场景中的听觉元素。有效完成此任务需要从数据和模型两方面进行综合考量。本研究提出了一种轻量级方法SAVE，该方法能够高效地将预训练的Segment Anything模型（SAM）适配到AVS任务中。通过向Transformer模块中引入图像编码器适配器以更好地捕捉特定数据集信息，并提出残差音频编码器适配器将音频特征编码为稀疏提示，我们提出的模型在编码阶段实现了有效的音视频融合与交互。我们提出的方法通过将输入分辨率从1024像素降低至256像素，加速了训练和推理速度，同时相比先前的最优方法（SOTA）取得了更高的性能。大量实验验证了我们的方法，表明我们提出的模型显著优于其他SOTA方法。此外，利用在合成数据上预训练的模型提升了在真实AVSBench数据上的性能，在输入图像仅为256像素的情况下，在S4（V1S）子集上达到了84.59 mIoU，在MS3（V1M）集上达到了70.28 mIoU。当输入图像为1024像素时，该性能进一步提升至S4（V1S）上的86.16 mIoU和MS3（V1M）上的70.83 mIoU。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日