Until recently, the Video Instance Segmentation (VIS) community operated under the common belief that offline methods are generally superior to a frame by frame online processing. However, the recent success of online methods questions this belief, in particular, for challenging and long video sequences. We understand this work as a rebuttal of those recent observations and an appeal to the community to focus on dedicated near-online VIS approaches. To support our argument, we present a detailed analysis on different processing paradigms and the new end-to-end trainable NOVIS (Near-Online Video Instance Segmentation) method. Our transformer-based model directly predicts spatio-temporal mask volumes for clips of frames and performs instance tracking between clips via overlap embeddings. NOVIS represents the first near-online VIS approach which avoids any handcrafted tracking heuristics. We outperform all existing VIS methods by large margins and provide new state-of-the-art results on both YouTube-VIS (2019/2021) and the OVIS benchmarks.
翻译:直至近期,视频实例分割(VIS)领域普遍认为离线方法通常优于逐帧在线处理方法。然而,在线方法的最新成功对这一观点提出了质疑,尤其是在处理长视频序列等挑战性场景时。本研究旨在反驳这些最新观察,并呼吁学界聚焦于专门的近在线VIS方法。为支持我们的论证,我们深入分析了不同处理范式,并提出了全新的端到端可训练方法NOVIS(近在线视频实例分割)。基于Transformer架构的模型可直接预测视频片段帧的时空掩码体,并通过重叠嵌入实现片段间的实例跟踪。NOVIS是首个无需人工设计跟踪启发式规则的近在线VIS方法。我们以显著优势超越现有所有VIS方法,在YouTube-VIS(2019/2021)和OVIS基准测试中均取得了新的最优结果。