In the field of 3D object detection for autonomous driving, the sensor portfolio including multi-modality and single-modality is diverse and complex. Since the multi-modal methods have system complexity while the accuracy of single-modal ones is relatively low, how to make a tradeoff between them is difficult. In this work, we propose a universal cross-modality knowledge distillation framework (UniDistill) to improve the performance of single-modality detectors. Specifically, during training, UniDistill projects the features of both the teacher and the student detector into Bird's-Eye-View (BEV), which is a friendly representation for different modalities. Then, three distillation losses are calculated to sparsely align the foreground features, helping the student learn from the teacher without introducing additional cost during inference. Taking advantage of the similar detection paradigm of different detectors in BEV, UniDistill easily supports LiDAR-to-camera, camera-to-LiDAR, fusion-to-LiDAR and fusion-to-camera distillation paths. Furthermore, the three distillation losses can filter the effect of misaligned background information and balance between objects of different sizes, improving the distillation effectiveness. Extensive experiments on nuScenes demonstrate that UniDistill effectively improves the mAP and NDS of student detectors by 2.0%~3.2%.
翻译:在自动驾驶的三维目标检测领域,包含多模态与单模态的传感器配置复杂多样。由于多模态方法存在系统复杂性,而单模态方法的精度相对较低,如何在两者之间取得平衡具有挑战性。本文提出通用跨模态知识蒸馏框架(UniDistill),旨在提升单模态检测器的性能。具体而言,训练过程中,UniDistill将教师和学生检测器的特征投影到鸟瞰视角(BEV)——一种对多种模态友好的表征形式。随后,通过计算三种蒸馏损失对前景特征进行稀疏对齐,使学生在不引入额外推理成本的情况下从教师处学习。利用不同检测器在BEV中相似的检测范式,UniDistill可灵活支持激光雷达-相机、相机-激光雷达、融合-激光雷达及融合-相机蒸馏路径。此外,三种蒸馏损失能滤除背景信息不对齐的影响,并平衡不同尺寸物体间的权重,从而提升蒸馏效果。在nuScenes数据集上的大量实验表明,UniDistill能有效将学生检测器的mAP和NDS指标提升2.0%~3.2%。