In this paper, we propose a new multi-modal task, namely audio-visual instance segmentation (AVIS), in which the goal is to identify, segment, and track individual sounding object instances in audible videos, simultaneously. To our knowledge, it is the first time that instance segmentation has been extended into the audio-visual domain. To better facilitate this research, we construct the first audio-visual instance segmentation benchmark (AVISeg). Specifically, AVISeg consists of 1,258 videos with an average duration of 62.6 seconds from YouTube and public audio-visual datasets, where 117 videos have been annotated by using an interactive semi-automatic labeling tool based on the Segment Anything Model (SAM). In addition, we present a simple baseline model for the AVIS task. Our new model introduces an audio branch and a cross-modal fusion module to Mask2Former to locate all sounding objects. Finally, we evaluate the proposed method using two backbones on AVISeg. We believe that AVIS will inspire the community towards a more comprehensive multi-modal understanding.
翻译:在本文中,我们提出了一项新的多模态任务,即音频-视觉实例分割(AVIS),其目标是在有声视频中同时识别、分割并跟踪单个发声对象实例。据我们所知,这是实例分割首次被拓展至音频-视觉领域。为更好地促进该研究,我们构建了首个音频-视觉实例分割基准数据集(AVISeg)。具体而言,AVISeg包含来自YouTube及公开音频-视觉数据集的1258个视频,平均时长62.6秒,其中117个视频采用基于Segment Anything Model(SAM)的交互式半自动标注工具完成标注。此外,我们为AVIS任务提出了一个简单基线模型。该新模型为Mask2Former引入音频分支和跨模态融合模块,以定位所有发声对象。最终,我们在AVISeg上使用两种骨干网络对所提方法进行了评估。我们相信AVIS将启发学界迈向更全面的多模态理解。