Segment Anything Model (SAM) has recently shown its powerful effectiveness in visual segmentation tasks. However, there is less exploration concerning how SAM works on audio-visual tasks, such as visual sound localization and segmentation. In this work, we propose a simple yet effective audio-visual localization and segmentation framework based on the Segment Anything Model, namely AV-SAM, that can generate sounding object masks corresponding to the audio. Specifically, our AV-SAM simply leverages pixel-wise audio-visual fusion across audio features and visual features from the pre-trained image encoder in SAM to aggregate cross-modal representations. Then, the aggregated cross-modal features are fed into the prompt encoder and mask decoder to generate the final audio-visual segmentation masks. We conduct extensive experiments on Flickr-SoundNet and AVSBench datasets. The results demonstrate that the proposed AV-SAM can achieve competitive performance on sounding object localization and segmentation.
翻译:分割一切模型(Segment Anything Model, SAM)近期在视觉分割任务中展现了强大的有效性。然而,关于SAM在视听任务(如视觉声音定位与分割)中的应用研究仍相对较少。本文提出一种基于分割一切模型的简单而有效的视听定位与分割框架,即AV-SAM,能够生成与音频对应的发声物体掩膜。具体而言,我们的AV-SAM通过将SAM中预训练图像编码器提取的视觉特征与音频特征进行像素级视听融合,以聚合跨模态表征。随后,聚合后的跨模态特征被输入提示编码器和掩膜解码器,生成最终的视听分割掩膜。我们在Flickr-SoundNet和AVSBench数据集上进行了大量实验。结果表明,所提出的AV-SAM在发声物体定位与分割任务中能够取得具有竞争力的性能。