DVIS: Decoupled Video Instance Segmentation Framework

Video instance segmentation (VIS) is a critical task with diverse applications, including autonomous driving and video editing. Existing methods often underperform on complex and long videos in real world, primarily due to two factors. Firstly, offline methods are limited by the tightly-coupled modeling paradigm, which treats all frames equally and disregards the interdependencies between adjacent frames. Consequently, this leads to the introduction of excessive noise during long-term temporal alignment. Secondly, online methods suffer from inadequate utilization of temporal information. To tackle these challenges, we propose a decoupling strategy for VIS by dividing it into three independent sub-tasks: segmentation, tracking, and refinement. The efficacy of the decoupling strategy relies on two crucial elements: 1) attaining precise long-term alignment outcomes via frame-by-frame association during tracking, and 2) the effective utilization of temporal information predicated on the aforementioned accurate alignment outcomes during refinement. We introduce a novel referring tracker and temporal refiner to construct the \textbf{D}ecoupled \textbf{VIS} framework (\textbf{DVIS}). DVIS achieves new SOTA performance in both VIS and VPS, surpassing the current SOTA methods by 7.3 AP and 9.6 VPQ on the OVIS and VIPSeg datasets, which are the most challenging and realistic benchmarks. Moreover, thanks to the decoupling strategy, the referring tracker and temporal refiner are super light-weight (only 1.69\% of the segmenter FLOPs), allowing for efficient training and inference on a single GPU with 11G memory. The code is available at \href{https://github.com/zhang-tao-whu/DVIS}{https://github.com/zhang-tao-whu/DVIS}.

翻译：视频实例分割（VIS）是一项具有广泛应用的关键任务，包括自动驾驶和视频编辑。现有方法在处理现实世界中复杂且较长的视频时往往表现不佳，主要归因于两个因素。首先，离线方法受限于紧耦合的建模范式，该范式平等对待所有帧而忽视相邻帧之间的相互依赖关系，从而在长期时序对齐过程中引入过多噪声。其次，在线方法未能充分利用时序信息。为应对这些挑战，我们提出一种解耦策略，将VIS分解为三个独立的子任务：分割、跟踪和细化。该解耦策略的有效性依赖于两个关键要素：1）在跟踪过程中通过逐帧关联获得精确的长期对齐结果，以及2）在细化过程中基于上述精确对齐结果有效利用时序信息。我们引入一种新型参考跟踪器和时序细化器，构建了解耦式视频实例分割框架（DVIS）。DVIS在VIS和VPS任务中均达到新的最优性能，在最具挑战性和真实性的OVIS和VIPSeg数据集上分别以7.3 AP和9.6 VPQ超越现有最优方法。此外，得益于解耦策略，参考跟踪器和时序细化器极为轻量（仅占分割器FLOPs的1.69%），可在11G显存的单GPU上实现高效训练与推理。代码已开源至\href{https://github.com/zhang-tao-whu/DVIS}{https://github.com/zhang-tao-whu/DVIS}。