Referring video object segmentation (RVOS) aims at segmenting an object in a video following human instruction. Current state-of-the-art methods fall into an offline pattern, in which each clip independently interacts with text embedding for cross-modal understanding. They usually present that the offline pattern is necessary for RVOS, yet model limited temporal association within each clip. In this work, we break up the previous offline belief and propose a simple yet effective online model using explicit query propagation, named OnlineRefer. Specifically, our approach leverages target cues that gather semantic information and position prior to improve the accuracy and ease of referring predictions for the current frame. Furthermore, we generalize our online model into a semi-online framework to be compatible with video-based backbones. To show the effectiveness of our method, we evaluate it on four benchmarks, \ie, Refer-Youtube-VOS, Refer-DAVIS17, A2D-Sentences, and JHMDB-Sentences. Without bells and whistles, our OnlineRefer with a Swin-L backbone achieves 63.5 J&F and 64.8 J&F on Refer-Youtube-VOS and Refer-DAVIS17, outperforming all other offline methods.
翻译:视频引用对象分割旨在根据人类指令分割视频中的目标。当前最先进方法采用离线模式,每个片段独立与文本嵌入交互以实现跨模态理解。这些方法通常声称离线模式对视频引用对象分割不可或缺,但仅保持片段内有限的时间关联。本文打破以往离线模式的认知,提出一种简洁有效的在线模型——OnlineRefer,采用显式查询传播机制。具体而言,我们的方法利用包含语义信息和位置先验的目标线索,提升当前帧引用预测的准确性与简便性。此外,我们将在线模型泛化为半在线框架,使其兼容基于视频的骨干网络。为验证方法有效性,我们在四个基准数据集(即Refer-Youtube-VOS、Refer-DAVIS17、A2D-Sentences和JHMDB-Sentences)上进行评估。无需额外花哨操作,采用Swin-L骨干网络的OnlineRefer在Refer-Youtube-VOS和Refer-DAVIS17上分别达到63.5 J&F和64.8 J&F,超越所有其他离线方法。