Training data for video segmentation are expensive to annotate. This impedes extensions of end-to-end algorithms to new video segmentation tasks, especially in large-vocabulary settings. To 'track anything' without training on video data for every individual task, we develop a decoupled video segmentation approach (DEVA), composed of task-specific image-level segmentation and class/task-agnostic bi-directional temporal propagation. Due to this design, we only need an image-level model for the target task (which is cheaper to train) and a universal temporal propagation model which is trained once and generalizes across tasks. To effectively combine these two modules, we use bi-directional propagation for (semi-)online fusion of segmentation hypotheses from different frames to generate a coherent segmentation. We show that this decoupled formulation compares favorably to end-to-end approaches in several data-scarce tasks including large-vocabulary video panoptic segmentation, open-world video segmentation, referring video segmentation, and unsupervised video object segmentation. Code is available at: https://hkchengrex.github.io/Tracking-Anything-with-DEVA
翻译:视频分割的训练数据标注成本高昂,这阻碍了端到端算法向新视频分割任务的扩展,尤其是在大规模词汇场景下。为了在无需针对每个任务单独训练视频数据的情况下实现“追踪万物”,我们提出了一种解耦视频分割方法(DEVA),该方法由任务特定的图像级分割和类别/任务无关的双向时序传播组成。由于这一设计,我们仅需为目标任务训练一个图像级模型(其训练成本更低),以及一个通用化的、只需训练一次且可跨任务泛化的时序传播模型。为有效结合这两个模块,我们采用双向传播对不同帧的分割假设进行(半)在线融合,以生成连贯的分割结果。实验表明,在包括大规模词汇视频全景分割、开放世界视频分割、指代视频分割和无监督视频目标分割在内的多个数据稀疏任务中,这种解耦范式相比端到端方法具有明显优势。代码开源地址:https://hkchengrex.github.io/Tracking-Anything-with-DEVA