Traditional reference segmentation tasks have predominantly focused on silent visual scenes, neglecting the integral role of multimodal perception and interaction in human experiences. In this work, we introduce a novel task called Reference Audio-Visual Segmentation (Ref-AVS), which seeks to segment objects within the visual domain based on expressions containing multimodal cues. Such expressions are articulated in natural language forms but are enriched with multimodal cues, including audio and visual descriptions. To facilitate this research, we construct the first Ref-AVS benchmark, which provides pixel-level annotations for objects described in corresponding multimodal-cue expressions. To tackle the Ref-AVS task, we propose a new method that adequately utilizes multimodal cues to offer precise segmentation guidance. Finally, we conduct quantitative and qualitative experiments on three test subsets to compare our approach with existing methods from related tasks. The results demonstrate the effectiveness of our method, highlighting its capability to precisely segment objects using multimodal-cue expressions. Dataset is available at \href{https://gewu-lab.github.io/Ref-AVS}{https://gewu-lab.github.io/Ref-AVS}.
翻译:传统的指涉分割任务主要集中于无声视觉场景,忽视了多模态感知与交互在人类体验中的整体作用。本文提出一种新颖的任务——指涉视听分割(Ref-AVS),其目标在于根据包含多模态线索的描述表达式,在视觉域中分割目标。此类表达式以自然语言形式表述,但融合了包括听觉与视觉描述在内的多模态线索。为推进此项研究,我们构建了首个Ref-AVS基准数据集,为对应多模态线索表达式所描述的目标提供像素级标注。针对Ref-AVS任务,我们提出一种新方法,该方法充分运用多模态线索以提供精确的分割指导。最后,我们在三个测试子集上进行了定量与定性实验,将所提方法与相关任务的现有方法进行比较。实验结果验证了本方法的有效性,突显了其利用多模态线索表达式精确分割目标的能力。数据集发布于 \href{https://gewu-lab.github.io/Ref-AVS}{https://gewu-lab.github.io/Ref-AVS}。