Recent LiDAR-based 3D Object Detection (3DOD) methods show promising results, but they often do not generalize well to target domains outside the source (or training) data distribution. To reduce such domain gaps and thus to make 3DOD models more generalizable, we introduce a novel unsupervised domain adaptation (UDA) method, called CMDA, which (i) leverages visual semantic cues from an image modality (i.e., camera images) as an effective semantic bridge to close the domain gap in the cross-modal Bird's Eye View (BEV) representations. Further, (ii) we also introduce a self-training-based learning strategy, wherein a model is adversarially trained to generate domain-invariant features, which disrupt the discrimination of whether a feature instance comes from a source or an unseen target domain. Overall, our CMDA framework guides the 3DOD model to generate highly informative and domain-adaptive features for novel data distributions. In our extensive experiments with large-scale benchmarks, such as nuScenes, Waymo, and KITTI, those mentioned above provide significant performance gains for UDA tasks, achieving state-of-the-art performance.
翻译:近期基于激光雷达的三维目标检测(3DOD)方法虽展现出显著成效,但通常难以泛化至源域(或称训练数据分布)之外的目标域。为弥合此类领域差距、提升3DOD模型的泛化能力,本文提出一种名为CMDA的新型无监督域适应(UDA)方法,其核心机制包含:(i) 利用图像模态(即摄像头图像)中的视觉语义线索作为有效的语义桥梁,以缩小跨模态鸟瞰图(BEV)表征中的领域差异;(ii) 引入基于自训练的学习策略,通过对抗性训练生成域不变特征,从而破坏对特征实例源自源域或未知目标域的判别能力。总体而言,CMDA框架引导3DOD模型为新型数据分布生成高度信息量且具备域自适应能力的特征。在nuScenes、Waymo和KITTI等大规模基准数据集上的充分实验表明,上述方法为UDA任务带来了显著的性能提升,并实现了最先进的性能。