Monocular 3D object detection is an inherently ill-posed problem, as it is challenging to predict accurate 3D localization from a single image. Existing monocular 3D detection knowledge distillation methods usually project the LiDAR onto the image plane and train the teacher network accordingly. Transferring LiDAR-based model knowledge to RGB-based models is more complex, so a general distillation strategy is needed. To alleviate cross-modal prob-lem, we propose MonoSKD, a novel Knowledge Distillation framework for Monocular 3D detection based on Spearman correlation coefficient, to learn the relative correlation between cross-modal features. Considering the large gap between these features, strict alignment of features may mislead the training, so we propose a looser Spearman loss. Furthermore, by selecting appropriate distillation locations and removing redundant modules, our scheme saves more GPU resources and trains faster than existing methods. Extensive experiments are performed to verify the effectiveness of our framework on the challenging KITTI 3D object detection benchmark. Our method achieves state-of-the-art performance until submission with no additional inference computational cost. Our codes are available at https://github.com/Senwang98/MonoSKD
翻译:单目3D目标检测本质上是一个病态问题,因为从单张图像预测准确的3D定位极具挑战性。现有的单目3D检测知识蒸馏方法通常将激光雷达点云投影到图像平面并据此训练教师网络。将基于激光雷达的模型知识迁移到基于RGB的模型更为复杂,因此需要一种通用蒸馏策略。为缓解跨模态问题,我们提出MonoSKD——一种基于斯皮尔曼相关系数的单目3D检测知识蒸馏框架,用于学习跨模态特征间的相对关联性。考虑到这些特征之间存在巨大差异,严格对齐特征可能误导训练,因此我们提出更宽松的斯皮尔曼损失。此外,通过选择合适的蒸馏位置并移除冗余模块,我们的方案相比现有方法能节省更多GPU资源并实现更快的训练速度。在具有挑战性的KITTI 3D目标检测基准上进行的大量实验验证了该框架的有效性。我们的方法在提交时达到了最先进性能,且不增加额外推理计算成本。代码已开源至https://github.com/Senwang98/MonoSKD。