Instance segmentation is a fundamental research in computer vision, especially in autonomous driving. However, manual mask annotation for instance segmentation is quite time-consuming and costly. To address this problem, some prior works attempt to apply weakly supervised manner by exploring 2D or 3D boxes. However, no one has ever successfully segmented 2D and 3D instances simultaneously by only using 2D box annotations, which could further reduce the annotation cost by an order of magnitude. Thus, we propose a novel framework called Multimodal Weakly Supervised Instance Segmentation (MWSIS), which incorporates various fine-grained label generation and correction modules for both 2D and 3D modalities to improve the quality of pseudo labels, along with a new multimodal cross-supervision approach, named Consistency Sparse Cross-modal Supervision (CSCS), to reduce the inconsistency of multimodal predictions by response distillation. Particularly, transferring the 3D backbone to downstream tasks not only improves the performance of the 3D detectors, but also outperforms fully supervised instance segmentation with only 5% fully supervised annotations. On the Waymo dataset, the proposed framework demonstrates significant improvements over the baseline, especially achieving 2.59% mAP and 12.75% mAP increases for 2D and 3D instance segmentation tasks, respectively. The code is available at https://github.com/jiangxb98/mwsis-plugin.
翻译:实例分割是计算机视觉中的基础研究,尤其在自动驾驶领域。然而,为实例分割手动标注掩膜既耗时又昂贵。为解决此问题,以往一些工作尝试通过探索2D或3D框来应用弱监督方式。但至今无人能仅利用2D框标注同时成功分割2D和3D实例,而这可将标注成本降低一个数量级。为此,我们提出一种名为多模态弱监督实例分割(MWSIS)的新框架,该框架为2D和3D模态整合了多种细粒度标签生成与校正模块,以提升伪标签质量;同时引入一种新的多模态交叉监督方法——一致性稀疏跨模态监督(CSCS),通过响应蒸馏减少多模态预测的不一致性。特别地,将3D主干网络迁移至下游任务不仅提升了3D检测器的性能,还仅需5%的全监督标注即可超越全监督实例分割。在Waymo数据集上,所提框架相较于基线取得了显著提升,其中2D和3D实例分割任务的mAP分别提升2.59%和12.75%。代码开源地址:https://github.com/jiangxb98/mwsis-plugin。