Fully Transformer-Equipped Architecture for End-to-End Referring Video Object Segmentation

Referring Video Object Segmentation (RVOS) requires segmenting the object in video referred by a natural language query. Existing methods mainly rely on sophisticated pipelines to tackle such cross-modal task, and do not explicitly model the object-level spatial context which plays an important role in locating the referred object. Therefore, we propose an end-to-end RVOS framework completely built upon transformers, termed \textit{Fully Transformer-Equipped Architecture} (FTEA), which treats the RVOS task as a mask sequence learning problem and regards all the objects in video as candidate objects. Given a video clip with a text query, the visual-textual features are yielded by encoder, while the corresponding pixel-level and word-level features are aligned in terms of semantic similarity. To capture the object-level spatial context, we have developed the Stacked Transformer, which individually characterizes the visual appearance of each candidate object, whose feature map is decoded to the binary mask sequence in order directly. Finally, the model finds the best matching between mask sequence and text query. In addition, to diversify the generated masks for candidate objects, we impose a diversity loss on the model for capturing more accurate mask of the referred object. Empirical studies have shown the superiority of the proposed method on three benchmarks, e.g., FETA achieves 45.1% and 38.7% in terms of mAP on A2D Sentences (3782 videos) and J-HMDB Sentences (928 videos), respectively; it achieves 56.6% in terms of $\mathcal{J\&F}$ on Ref-YouTube-VOS (3975 videos and 7451 objects). Particularly, compared to the best candidate method, it has a gain of 2.1% and 3.2% in terms of P$@$0.5 on the former two, respectively, while it has a gain of 2.9% in terms of $\mathcal{J}$ on the latter one.

翻译：视频目标指代分割（RVOS）要求根据自然语言查询分割视频中对应的目标。现有方法主要依赖复杂的流程来处理这种跨模态任务，且未能显式建模在定位指代目标中起重要作用的目标级空间上下文。为此，我们提出一种完全基于Transformer构建的端到端RVOS框架，称为完全Transformer架构（FTEA），该框架将RVOS任务视为掩码序列学习问题，并将视频中所有目标视为候选目标。给定视频片段和文本查询，编码器生成视觉-文本特征，同时对应的像素级和词级特征在语义相似度层面进行对齐。为捕获目标级空间上下文，我们开发了堆叠式Transformer，它独立刻画每个候选目标的视觉表观，其特征图直接解码为二进制掩码序列。最终，模型寻找掩码序列与文本查询的最佳匹配。此外，为多样化候选目标生成的掩码，我们对模型施加多样性损失，以捕获更准确的指代目标掩码。实验研究表明，所提方法在三个基准数据集上具有优越性：例如，FETA在A2D Sentences（3782个视频）和J-HMDB Sentences（928个视频）上的mAP分别达45.1%和38.7%；在Ref-YouTube-VOS（3975个视频和7451个目标）上的$\mathcal{J\&F}$达56.6%。特别地，与最优候选方法相比，在前两个数据集上P$@$0.5指标分别提升2.1%和3.2%，在后一个数据集上$\mathcal{J}$指标提升2.9%。