The general domain of video segmentation is currently fragmented into different tasks spanning multiple benchmarks. Despite rapid progress in the state-of-the-art, current methods are overwhelmingly task-specific and cannot conceptually generalize to other tasks. Inspired by recent approaches with multi-task capability, we propose TarViS: a novel, unified network architecture that can be applied to any task that requires segmenting a set of arbitrarily defined 'targets' in video. Our approach is flexible with respect to how tasks define these targets, since it models the latter as abstract 'queries' which are then used to predict pixel-precise target masks. A single TarViS model can be trained jointly on a collection of datasets spanning different tasks, and can hot-swap between tasks during inference without any task-specific retraining. To demonstrate its effectiveness, we apply TarViS to four different tasks, namely Video Instance Segmentation (VIS), Video Panoptic Segmentation (VPS), Video Object Segmentation (VOS) and Point Exemplar-guided Tracking (PET). Our unified, jointly trained model achieves state-of-the-art performance on 5/7 benchmarks spanning these four tasks, and competitive performance on the remaining two. Code and model weights are available at: https://github.com/Ali2500/TarViS
翻译:视频分割的通用领域目前被划分为跨越多项基准测试的不同任务。尽管现有技术取得了快速进展,但当前方法大多具有任务特异性,无法在概念上泛化到其他任务。受近期多任务能力方法的启发,我们提出TarViS:一种新颖的统一网络架构,可应用于任何需要对视频中任意定义的“目标”进行分割的任务。该方法在任务定义这些目标的方式上具有灵活性,因为它将目标建模为抽象的“查询”,进而用于预测像素级精确的目标掩码。单个TarViS模型可以在涵盖不同任务的数据集集合上联合训练,并且在推理过程中无需任何特定于任务的重新训练即可在任务间热切换。为证明其有效性,我们将TarViS应用于四项不同任务:视频实例分割(VIS)、视频全景分割(VPS)、视频目标分割(VOS)和点示例引导追踪(PET)。我们统一联合训练的模型在四项任务中的5/7项基准测试中取得了最先进性能,在其余两项中表现具有竞争力。代码和模型权重见:https://github.com/Ali2500/TarViS