Perception is crucial in the realm of autonomous driving systems, where bird's eye view (BEV)-based architectures have recently reached state-of-the-art performance. The desirability of self-supervised representation learning stems from the expensive and laborious process of annotating 2D and 3D data. Although previous research has investigated pretraining methods for both LiDAR and camera-based 3D object detection, a unified pretraining framework for multimodal BEV perception is missing. In this study, we introduce CALICO, a novel framework that applies contrastive objectives to both LiDAR and camera backbones. Specifically, CALICO incorporates two stages: point-region contrast (PRC) and region-aware distillation (RAD). PRC better balances the region- and scene-level representation learning on the LiDAR modality and offers significant performance improvement compared to existing methods. RAD effectively achieves contrastive distillation on our self-trained teacher model. CALICO's efficacy is substantiated by extensive evaluations on 3D object detection and BEV map segmentation tasks, where it delivers significant performance improvements. Notably, CALICO outperforms the baseline method by 10.5% and 8.6% on NDS and mAP. Moreover, CALICO boosts the robustness of multimodal 3D object detection against adversarial attacks and corruption. Additionally, our framework can be tailored to different backbones and heads, positioning it as a promising approach for multimodal BEV perception.
翻译:感知在自动驾驶系统中至关重要,其中基于鸟瞰图(BEV)的架构近期已取得最先进性能。自监督表示学习的吸引力源于2D和3D数据标注过程的高昂成本与繁重劳动。尽管已有研究探索了针对激光雷达和基于相机的3D目标检测的预训练方法,但面向多模态BEV感知的统一预训练框架尚属空白。本研究提出CALICO,一种将对比学习目标同时应用于激光雷达与相机骨干网络的新型框架。具体而言,CALICO包含两个阶段:点-区域对比(PRC)和区域感知蒸馏(RAD)。PRC在激光雷达模态上更好地平衡了区域级与场景级表示学习,相比现有方法实现了显著的性能提升。RAD则有效实现了基于自训练教师模型的对比蒸馏。通过在3D目标检测和BEV地图分割任务上的广泛评估,CALICO的有效性得到了充分验证,并带来了显著的性能改进。值得注意的是,CALICO在NDS和mAP指标上分别超越基线方法10.5%和8.6%。此外,CALICO增强了多模态3D目标检测对抗攻击与数据损坏的鲁棒性。该框架还可适配不同的骨干网络与检测头,使其成为多模态BEV感知领域一种极具前景的方法。