Audio-guided Video Object Segmentation (A-VOS) and Referring Video Object Segmentation (R-VOS) are two highly-related tasks, which both aim to segment specific objects from video sequences according to user-provided expression prompts. However, due to the challenges in modeling representations for different modalities, contemporary methods struggle to strike a balance between interaction flexibility and high-precision localization and segmentation. In this paper, we address this problem from two perspectives: the alignment representation of audio and text and the deep interaction among audio, text, and visual features. First, we propose a universal architecture, the Expression Prompt Collaboration Transformer, herein EPCFormer. Next, we propose an Expression Alignment (EA) mechanism for audio and text expressions. By introducing contrastive learning for audio and text expressions, the proposed EPCFormer realizes comprehension of the semantic equivalence between audio and text expressions denoting the same objects. Then, to facilitate deep interactions among audio, text, and video features, we introduce an Expression-Visual Attention (EVA) mechanism. The knowledge of video object segmentation in terms of the expression prompts can seamlessly transfer between the two tasks by deeply exploring complementary cues between text and audio. Experiments on well-recognized benchmarks demonstrate that our universal EPCFormer attains state-of-the-art results on both tasks. The source code of EPCFormer will be made publicly available at https://github.com/lab206/EPCFormer.
翻译:音频引导视频目标分割(A-VOS)与指代视频目标分割(R-VOS)是两项高度相关的任务,均旨在根据用户提供的表达提示,从视频序列中分割特定目标。然而,由于不同模态表示建模的挑战,现有方法难以在交互灵活性与高精度定位及分割之间取得平衡。本文从两个角度解决该问题:音频与文本的对齐表示,以及音频、文本与视觉特征的深度交互。首先,我们提出通用架构——表达提示协作Transformer(简称EPCFormer)。其次,针对音频与文本表达,引入表达对齐机制。通过引入音频与文本表达的对比学习,所提出的EPCFormer能够理解表征同一目标的音频与文本表达之间的语义等价性。随后,为促进音频、文本与视频特征的深度交互,我们引入表达-视觉注意力机制。通过深度挖掘文本与音频之间的互补线索,基于表达提示的视频目标分割知识可在两项任务间无缝迁移。在广泛认可的基准数据集上的实验表明,我们的通用EPCFormer在两项任务上均取得最先进性能。EPCFormer的源代码将在https://github.com/lab206/EPCFormer公开提供。