Can Textual Semantics Mitigate Sounding Object Segmentation Preference?

The Audio-Visual Segmentation (AVS) task aims to segment sounding objects in the visual space using audio cues. However, in this work, it is recognized that previous AVS methods show a heavy reliance on detrimental segmentation preferences related to audible objects, rather than precise audio guidance. We argue that the primary reason is that audio lacks robust semantics compared to vision, especially in multi-source sounding scenes, resulting in weak audio guidance over the visual space. Motivated by the the fact that text modality is well explored and contains rich abstract semantics, we propose leveraging text cues from the visual scene to enhance audio guidance with the semantics inherent in text. Our approach begins by obtaining scene descriptions through an off-the-shelf image captioner and prompting a frozen large language model to deduce potential sounding objects as text cues. Subsequently, we introduce a novel semantics-driven audio modeling module with a dynamic mask to integrate audio features with text cues, leading to representative sounding object features. These features not only encompass audio cues but also possess vivid semantics, providing clearer guidance in the visual space. Experimental results on AVS benchmarks validate that our method exhibits enhanced sensitivity to audio when aided by text cues, achieving highly competitive performance on all three subsets. Project page: \href{https://github.com/GeWu-Lab/Sounding-Object-Segmentation-Preference}{https://github.com/GeWu-Lab/Sounding-Object-Segmentation-Preference}

翻译：视听分割（AVS）任务旨在利用音频线索在视觉空间中分割发声物体。然而，本研究发现，以往的AVS方法表现出对与可听物体相关的有害分割偏好的严重依赖，而非精确的音频引导。我们认为，其主要原因在于与视觉相比，音频缺乏鲁棒的语义信息，尤其是在多声源场景中，导致音频对视觉空间的引导能力较弱。考虑到文本模态已被充分探索且包含丰富的抽象语义，我们提出利用视觉场景中的文本线索，通过文本固有的语义来增强音频引导。我们的方法首先通过现成的图像描述生成器获取场景描述，并提示一个冻结的大型语言模型来推断潜在的发声物体作为文本线索。随后，我们引入了一种新颖的语义驱动音频建模模块，该模块配备动态掩码，以将音频特征与文本线索相融合，从而生成具有代表性的发声物体特征。这些特征不仅包含音频线索，而且拥有生动的语义，为视觉空间提供了更清晰的引导。在AVS基准测试上的实验结果验证了我们的方法在文本线索辅助下对音频的敏感性得到增强，并在所有三个子集上实现了极具竞争力的性能。项目页面：\href{https://github.com/GeWu-Lab/Sounding-Object-Segmentation-Preference}{https://github.com/GeWu-Lab/Sounding-Object-Segmentation-Preference}