Referred by Multi-Modality: A Unified Temporal Transformer for Video Object Segmentation

Recently, video object segmentation (VOS) referred by multi-modal signals, e.g., language and audio, has evoked increasing attention in both industry and academia. It is challenging for exploring the semantic alignment within modalities and the visual correspondence across frames. However, existing methods adopt separate network architectures for different modalities, and neglect the inter-frame temporal interaction with references. In this paper, we propose MUTR, a Multi-modal Unified Temporal transformer for Referring video object segmentation. With a unified framework for the first time, MUTR adopts a DETR-style transformer and is capable of segmenting video objects designated by either text or audio reference. Specifically, we introduce two strategies to fully explore the temporal relations between videos and multi-modal signals. Firstly, for low-level temporal aggregation before the transformer, we enable the multi-modal references to capture multi-scale visual cues from consecutive video frames. This effectively endows the text or audio signals with temporal knowledge and boosts the semantic alignment between modalities. Secondly, for high-level temporal interaction after the transformer, we conduct inter-frame feature communication for different object embeddings, contributing to better object-wise correspondence for tracking along the video. On Ref-YouTube-VOS and AVSBench datasets with respective text and audio references, MUTR achieves +4.2% and +4.2% J&F improvements to state-of-the-art methods, demonstrating our significance for unified multi-modal VOS. Code is released at https://github.com/OpenGVLab/MUTR.

翻译：近年来，由多模态信号（如语言和音频）指代的视频目标分割（VOS）在工业界和学术界引起了越来越多的关注。探索模态间的语义对齐以及跨帧视觉对应关系极具挑战性。然而，现有方法对不同模态采用独立的网络架构，且忽视了参考信号与帧间时序交互。本文提出MUTR——一种用于指代视频目标分割的多模态统一时序Transformer。MUTR首次采用统一框架，基于DETR式Transformer架构，能够分割由文本或音频指代的视频目标。具体而言，我们引入两种策略充分挖掘视频与多模态信号间的时序关联：首先，在Transformer前的低层时序聚合阶段，使多模态参考信号能够从连续视频帧中捕获多尺度视觉线索，有效赋予文本或音频信号时序知识，增强模态间语义对齐；其次，在Transformer后的高层时序交互阶段，对不同目标嵌入进行帧间特征通信，有助于沿视频跟踪时实现更优的目标对应关系。在采用文本和音频指代的Ref-YouTube-VOS和AVSBench数据集上，MUTR分别比现有最优方法提升了4.2%和4.2%的J&F指标，证明了我们对统一多模态VOS的重要意义。代码已发布于https://github.com/OpenGVLab/MUTR。