Until recently, the Video Instance Segmentation (VIS) community operated under the common belief that offline methods are generally superior to a frame by frame online processing. However, the recent success of online methods questions this belief, in particular, for challenging and long video sequences. We understand this work as a rebuttal of those recent observations and an appeal to the community to focus on dedicated near-online VIS approaches. To support our argument, we present a detailed analysis on different processing paradigms and the new end-to-end trainable NOVIS (Near-Online Video Instance Segmentation) method. Our transformer-based model directly predicts spatio-temporal mask volumes for clips of frames and performs instance tracking between clips via overlap embeddings. NOVIS represents the first near-online VIS approach which avoids any handcrafted tracking heuristics. We outperform all existing VIS methods by large margins and provide new state-of-the-art results on both YouTube-VIS (2019/2021) and the OVIS benchmarks.
翻译:直到最近,视频实例分割(VIS)领域普遍认为离线方法通常优于逐帧在线处理。然而,近期在线方法的成功对这一观点提出了质疑,特别是在处理具有挑战性的长视频序列时。本文将这项工作视为对上述新近观察的反驳,并呼吁学界关注专用的近在线VIS方法。为支持这一论点,我们详细分析了不同处理范式,并提出了新的端到端可训练方法NOVIS(近在线视频实例分割)。我们的基于Transformer的模型可直接预测视频帧剪辑的时空掩码体积,并通过重叠嵌入实现剪辑间的实例跟踪。NOVIS是首个避免任何手工设计跟踪启发式的近在线VIS方法。我们在YouTube-VIS(2019/2021)和OVIS基准测试中大幅超越所有现有VIS方法,取得了新的最先进结果。