TMT-VIS: Taxonomy-aware Multi-dataset Joint Training for Video Instance Segmentation

Training on large-scale datasets can boost the performance of video instance segmentation while the annotated datasets for VIS are hard to scale up due to the high labor cost. What we possess are numerous isolated filed-specific datasets, thus, it is appealing to jointly train models across the aggregation of datasets to enhance data volume and diversity. However, due to the heterogeneity in category space, as mask precision increases with the data volume, simply utilizing multiple datasets will dilute the attention of models on different taxonomies. Thus, increasing the data scale and enriching taxonomy space while improving classification precision is important. In this work, we analyze that providing extra taxonomy information can help models concentrate on specific taxonomy, and propose our model named Taxonomy-aware Multi-dataset Joint Training for Video Instance Segmentation (TMT-VIS) to address this vital challenge. Specifically, we design a two-stage taxonomy aggregation module that first compiles taxonomy information from input videos and then aggregates these taxonomy priors into instance queries before the transformer decoder. We conduct extensive experimental evaluations on four popular and challenging benchmarks, including YouTube-VIS 2019, YouTube-VIS 2021, OVIS, and UVO. Our model shows significant improvement over the baseline solutions, and sets new state-of-the-art records on all benchmarks. These appealing and encouraging results demonstrate the effectiveness and generality of our approach. The code is available at https://github.com/rkzheng99/TMT-VIS(https://github.com/rkzheng99/TMT-VIS)

翻译：大规模数据集上的训练能够提升视频实例分割性能，然而由于高昂的人工标注成本，VIS标注数据集的规模难以扩展。当前我们拥有大量孤立的特定领域数据集，因此通过聚合数据集进行联合训练以增强数据量和多样性具有吸引力。然而，由于类别空间存在异质性，随着数据量增加导致掩膜精度提升时，简单地使用多个数据集会稀释模型在不同类别体系上的关注度。因此，在扩大数据规模、丰富类别体系空间的同时提升分类精度至关重要。本文通过分析发现，提供额外类别体系信息有助于模型聚焦特定类别，进而提出名为"面向视频实例分割的类别感知多数据集联合训练"（TMT-VIS）的模型来解决这一关键挑战。具体而言，我们设计了一个两阶段类别体系聚合模块：首先从输入视频中编译类别体系信息，然后在Transformer解码器前将这些类别先验信息聚合到实例查询中。我们在YouTube-VIS 2019、YouTube-VIS 2021、OVIS和UVO四个主流且具有挑战性的基准上进行了广泛实验评估。结果表明，我们的模型相较基线方法取得了显著提升，并在所有基准上创下新的最优纪录。这些令人振奋的结果验证了方法的有效性和普适性。代码已开源至 https://github.com/rkzheng99/TMT-VIS