Mask-Free Video Instance Segmentation

The recent advancement in Video Instance Segmentation (VIS) has largely been driven by the use of deeper and increasingly data-hungry transformer-based models. However, video masks are tedious and expensive to annotate, limiting the scale and diversity of existing VIS datasets. In this work, we aim to remove the mask-annotation requirement. We propose MaskFreeVIS, achieving highly competitive VIS performance, while only using bounding box annotations for the object state. We leverage the rich temporal mask consistency constraints in videos by introducing the Temporal KNN-patch Loss (TK-Loss), providing strong mask supervision without any labels. Our TK-Loss finds one-to-many matches across frames, through an efficient patch-matching step followed by a K-nearest neighbor selection. A consistency loss is then enforced on the found matches. Our mask-free objective is simple to implement, has no trainable parameters, is computationally efficient, yet outperforms baselines employing, e.g., state-of-the-art optical flow to enforce temporal mask consistency. We validate MaskFreeVIS on the YouTube-VIS 2019/2021, OVIS and BDD100K MOTS benchmarks. The results clearly demonstrate the efficacy of our method by drastically narrowing the gap between fully and weakly-supervised VIS performance. Our code and trained models are available at https://github.com/SysCV/MaskFreeVis.

翻译：近期视频实例分割（VIS）领域的进展主要得益于使用更深层且对数据需求日益增加的基于Transformer的模型。然而，视频掩模的标注既繁琐又昂贵，限制了现有VIS数据集的规模和多样性。本研究旨在消除掩模标注需求。我们提出MaskFreeVIS，在仅使用目标状态边界框标注的情况下，实现了极具竞争力的VIS性能。通过引入时间KNN补丁损失（TK-Loss），我们利用了视频中丰富的时序掩模一致性约束，无需任何标签即可提供强掩模监督。TK-Loss通过高效的补丁匹配步骤及随后的K近邻选择，在帧间建立一对多匹配，并对其施加一致性损失。该无掩模目标函数实现简单、无可训练参数、计算高效，且在利用例如最先进光流方法强制执行时序掩模一致性的基线方法中表现更优。我们在YouTube-VIS 2019/2021、OVIS和BDD100K MOTS基准上验证了MaskFreeVIS。结果明确证明了我们方法的效果，显著缩小了全监督与弱监督VIS性能之间的差距。我们的代码和训练模型已开源至https://github.com/SysCV/MaskFreeVis。

相关内容

视觉识别系统

关注 11

视觉识别系统出自“头脑风暴”一词。所谓头脑风暴（Brain-storming）系统是运用系统的、统一的视觉符号系统。视觉识别是静态的识别符号具体化、视觉化的传达形式，项目最多，层面最广，效果更直接。视觉识别系统属于CIS中的VI，用完整、体系的视觉传达体系，将企业理念、文化特质、服务内容、企业规范等抽象语意转换为具体符号的概念，塑造出独特的企业形象。视觉识别系统分为基本要素系统和应用要素系统两方面。基本要素系统主要包括：企业名称、企业标志、标准字、标准色、象征图案、宣传口语、市场行销报告书等。应用系统主要包括：办公事务用品、生产设备、建筑环境、产品包装、广告媒体、交通工具、衣着制服、旗帜、招牌、标识牌、橱窗、陈列展示等。视觉识别（VI）在CI系统大众所接受，据有主导的地位。

【ToG 2021】强化学习中图像局部区域敏感的探索奖励，Deep Reinforcement Learning with Part-aware Exploration Bonus in Video Games

专知会员服务

16+阅读 · 2022年3月29日

【CVPR2022】基于鲁棒区域特征生成的零样本目标检测

专知会员服务

11+阅读 · 2022年3月22日

【CVPR 2022】视觉提示调整（VPT），Vision Prompt Tuning

专知会员服务

32+阅读 · 2022年3月12日

【CVPR 2022】基于Tracklet查询和建议的高效视频实例分割，Efficient Video Instance Segmentation via Tracklet Query and Proposal

专知会员服务

16+阅读 · 2022年3月3日