Audio-visual segmentation aims to separate sounding objects from videos by predicting pixel-level masks based on audio signals. Existing methods primarily concentrate on closed-set scenarios and direct audio-visual alignment and fusion, which limits their capability to generalize to new, unseen situations. In this paper, we propose OpenAVS, a novel training-free language-based approach that, for the first time, effectively aligns audio and visual modalities using text as a proxy for open-vocabulary Audio-Visual Segmentation (AVS). Equipped with multimedia foundation models, OpenAVS directly infers masks through 1) audio-to-text prompt generation, 2) LLM-guided prompt translation, and 3) text-to-visual sounding object segmentation. The objective of OpenAVS is to establish a simple yet flexible architecture that relies on the most appropriate foundation models by fully leveraging their capabilities to enable more effective knowledge transfer to the downstream AVS task. Moreover, we present a model-agnostic framework OpenAVS-ST that enables the integration of OpenAVS with any advanced supervised AVS model via pseudo-label based self-training. This approach enhances performance by effectively utilizing large-scale unlabeled data when available. Comprehensive experiments on three benchmark datasets demonstrate the superior performance of OpenAVS. It surpasses existing unsupervised, zero-shot, and few-shot AVS methods by a significant margin, achieving absolute performance gains of approximately 9.4% and 10.9% in mIoU and F-score, respectively, in challenging scenarios.
翻译:视听分割旨在通过基于音频信号预测像素级掩码,将视频中发声物体分离出来。现有方法主要聚焦于封闭场景下的直接音频-视觉对齐与融合,这限制了其向未知新场景的泛化能力。本文提出OpenAVS——一种新颖的、无需训练的语言驱动方法,首次通过文本作为代理媒介,有效对齐音频与视觉模态,实现开放词汇视听分割(AVS)。借助多模态基础模型,OpenAVS通过以下步骤直接推断掩码:1)音频到文本提示生成,2)大语言模型(LLM)引导的提示翻译,3)文本到视觉发声物体分割。OpenAVS旨在构建一个简单而灵活的架构,通过充分利用最合适的基础模型能力,实现知识向下游AVS任务的高效迁移。此外,我们提出一种模型无关框架OpenAVS-ST,它通过基于伪标签的自训练方法,将OpenAVS与任意先进的有监督AVS模型无缝集成。该方法能够有效利用大规模无标签数据(若可用)提升性能。在三个基准数据集上的综合实验表明,OpenAVS具有卓越性能。相较于现有无监督、零样本及少样本AVS方法,OpenAVS在挑战性场景下实现了显著提升,mIoU和F-score分别获得约9.4%和10.9%的绝对性能增益。