The recent advancement in Video Instance Segmentation (VIS) has largely been driven by the use of deeper and increasingly data-hungry transformer-based models. However, video masks are tedious and expensive to annotate, limiting the scale and diversity of existing VIS datasets. In this work, we aim to remove the mask-annotation requirement. We propose MaskFreeVIS, achieving highly competitive VIS performance, while only using bounding box annotations for the object state. We leverage the rich temporal mask consistency constraints in videos by introducing the Temporal KNN-patch Loss (TK-Loss), providing strong mask supervision without any labels. Our TK-Loss finds one-to-many matches across frames, through an efficient patch-matching step followed by a K-nearest neighbor selection. A consistency loss is then enforced on the found matches. Our mask-free objective is simple to implement, has no trainable parameters, is computationally efficient, yet outperforms baselines employing, e.g., state-of-the-art optical flow to enforce temporal mask consistency. We validate MaskFreeVIS on the YouTube-VIS 2019/2021, OVIS and BDD100K MOTS benchmarks. The results clearly demonstrate the efficacy of our method by drastically narrowing the gap between fully and weakly-supervised VIS performance. Our code and trained models are available at https://github.com/SysCV/MaskFreeVis.
翻译:近期视频实例分割(VIS)领域的进展主要得益于使用更深层且对数据需求日益增加的基于Transformer的模型。然而,视频掩模的标注既繁琐又昂贵,限制了现有VIS数据集的规模和多样性。本研究旨在消除掩模标注需求。我们提出MaskFreeVIS,在仅使用目标状态边界框标注的情况下,实现了极具竞争力的VIS性能。通过引入时间KNN补丁损失(TK-Loss),我们利用了视频中丰富的时序掩模一致性约束,无需任何标签即可提供强掩模监督。TK-Loss通过高效的补丁匹配步骤及随后的K近邻选择,在帧间建立一对多匹配,并对其施加一致性损失。该无掩模目标函数实现简单、无可训练参数、计算高效,且在利用例如最先进光流方法强制执行时序掩模一致性的基线方法中表现更优。我们在YouTube-VIS 2019/2021、OVIS和BDD100K MOTS基准上验证了MaskFreeVIS。结果明确证明了我们方法的效果,显著缩小了全监督与弱监督VIS性能之间的差距。我们的代码和训练模型已开源至https://github.com/SysCV/MaskFreeVis。