Multi-modal unsupervised domain adaptation (MM-UDA) for 3D semantic segmentation is a practical solution to embed semantic understanding in autonomous systems without expensive point-wise annotations. While previous MM-UDA methods can achieve overall improvement, they suffer from significant class-imbalanced performance, restricting their adoption in real applications. This imbalanced performance is mainly caused by: 1) self-training with imbalanced data and 2) the lack of pixel-wise 2D supervision signals. In this work, we propose Multi-modal Prior Aided (MoPA) domain adaptation to improve the performance of rare objects. Specifically, we develop Valid Ground-based Insertion (VGI) to rectify the imbalance supervision signals by inserting prior rare objects collected from the wild while avoiding introducing artificial artifacts that lead to trivial solutions. Meanwhile, our SAM consistency loss leverages the 2D prior semantic masks from SAM as pixel-wise supervision signals to encourage consistent predictions for each object in the semantic mask. The knowledge learned from modal-specific prior is then shared across modalities to achieve better rare object segmentation. Extensive experiments show that our method achieves state-of-the-art performance on the challenging MM-UDA benchmark. Code will be available at https://github.com/AronCao49/MoPA.
翻译:多模态无监督域自适应(MM-UDA)用于三维语义分割是一种无需昂贵逐点标注即可在自主系统中嵌入语义理解的实用方案。尽管现有MM-UDA方法能实现整体性能提升,但其存在显著的类别不平衡问题,限制了实际应用。这种不平衡性能主要由以下两点导致:1)基于不平衡数据的自训练策略;2)缺乏像素级二维监督信号。本文提出多模态先验辅助域自适应(MoPA)方法以提升稀有物体的分割性能。具体而言,我们开发了基于有效地面插入(VGI)技术,通过从真实场景采集的先验稀有物体来修正不平衡的监督信号,同时避免引入导致平凡解的人工伪影。此外,我们提出的SAM一致性损失函数利用SAM生成的二维先验语义掩膜作为像素级监督信号,对语义掩膜中各物体施加预测一致性约束。通过模态间共享从各模态先验中习得的知识,实现了更优的稀有物体分割。大量实验表明,本方法在具有挑战性的MM-UDA基准上取得了最先进性能。代码将开源至https://github.com/AronCao49/MoPA。