Existing methods of cross-modal domain adaptation for 3D semantic segmentation predict results only via 2D-3D complementarity that is obtained by cross-modal feature matching. However, as lacking supervision in the target domain, the complementarity is not always reliable. The results are not ideal when the domain gap is large. To solve the problem of lacking supervision, we introduce masked modeling into this task and propose a method Mx2M, which utilizes masked cross-modality modeling to reduce the large domain gap. Our Mx2M contains two components. One is the core solution, cross-modal removal and prediction (xMRP), which makes the Mx2M adapt to various scenarios and provides cross-modal self-supervision. The other is a new way of cross-modal feature matching, the dynamic cross-modal filter (DxMF) that ensures the whole method dynamically uses more suitable 2D-3D complementarity. Evaluation of the Mx2M on three DA scenarios, including Day/Night, USA/Singapore, and A2D2/SemanticKITTI, brings large improvements over previous methods on many metrics.
翻译:现有基于跨模态域适应的3D语义分割方法仅通过跨模态特征匹配获取的2D-3D互补性进行预测。然而,由于目标域缺乏监督,这种互补性并非总是可靠的。当域差距较大时,预测结果不理想。为解决缺乏监督的问题,我们将掩码建模引入该任务,提出Mx2M方法,利用掩码跨模态建模减小大规模域差距。我们的Mx2M包含两个组成部分:一是核心方案跨模态移除与预测(xMRP),使Mx2M适应各类场景并提供跨模态自监督;二是新型跨模态特征匹配方式——动态跨模态滤波器(DxMF),确保整个方法动态利用更合适的2D-3D互补性。在昼夜、美国/新加坡及A2D2/SemanticKITTI三个域适应场景上对Mx2M进行评估,结果表明其在多项指标上相较以往方法均有大幅提升。