In this paper, we propose a new multi-modal task, termed audio-visual instance segmentation (AVIS), which aims to simultaneously identify, segment and track individual sounding object instances in audible videos. To facilitate this research, we introduce a high-quality benchmark named AVISeg, containing over 90K instance masks from 26 semantic categories in 926 long videos. Additionally, we propose a strong baseline model for this task. Our model first localizes sound source within each frame, and condenses object-specific contexts into concise tokens. Then it builds long-range audio-visual dependencies between these tokens using window-based attention, and tracks sounding objects among the entire video sequences. Extensive experiments reveal that our method performs best on AVISeg, surpassing the existing methods from related tasks. We further conduct the evaluation on several multi-modal large models; however, they exhibits subpar performance on instance-level sound source localization and temporal perception. We expect that AVIS will inspire the community towards a more comprehensive multi-modal understanding. The dataset and code will soon be released on https://github.com/ruohaoguo/avis.
翻译:本文提出一种新型多模态任务——视听实例分割(AVIS),其目标是在可听视频中同时识别、分割并跟踪各个发声物体实例。为推进该研究,我们构建了一个高质量基准数据集AVISeg,包含926个长视频中26个语义类别超过9万个实例掩码。此外,我们为此任务设计了一个强基线模型。该模型首先在每帧内定位声源,并将物体特定上下文信息压缩为紧凑的token表示;随后通过基于窗口的注意力机制建立这些token之间的长程视听依赖关系,并在完整视频序列中跟踪发声物体。大量实验表明,我们的方法在AVISeg数据集上表现最优,超越了相关任务的现有方法。我们进一步对多个多模态大模型进行评估,发现它们在实例级声源定位与时序感知方面表现欠佳。我们期望AVIS任务能够推动学界实现更全面的多模态理解。数据集与代码将通过https://github.com/ruohaoguo/avis 发布。