Multi-view camera-based 3D object detection has gained popularity due to its low cost. But accurately inferring 3D geometry solely from camera data remains challenging, which impacts model performance. One promising approach to address this issue is to distill precise 3D geometry knowledge from LiDAR data. However, transferring knowledge between different sensor modalities is hindered by the significant modality gap. In this paper, we approach this challenge from the perspective of both architecture design and knowledge distillation and present a new simulated multi-modal 3D object detection method named BEVSimDet. We first introduce a novel framework that includes a LiDAR and camera fusion-based teacher and a simulated multi-modal student, where the student simulates multi-modal features with image-only input. To facilitate effective distillation, we propose a simulated multi-modal distillation scheme that supports intra-modal, cross-modal, and multi-modal distillation simultaneously. By combining them together, BEVSimDet can learn better feature representations for 3D object detection while enjoying cost-effective camera-only deployment. Experimental results on the challenging nuScenes benchmark demonstrate the effectiveness and superiority of BEVSimDet over recent representative methods. The source code will be released.
翻译:基于多视角相机的3D目标检测因其低成本而受到广泛关注。然而,仅依靠相机数据准确推断3D几何信息仍具挑战性,这限制了模型性能。解决该问题的一个有效途径是从激光雷达数据中蒸馏出精确的3D几何知识。然而,不同传感器模态间的显著差异阻碍了知识迁移。本文从架构设计与知识蒸馏两个角度应对这一挑战,提出了一种名为BEVSimDet的新型模拟多模态3D目标检测方法。我们首先引入一个包含激光雷达与相机融合的教师模型及模拟多模态学生模型的新框架,其中学生模型仅通过图像输入即可模拟多模态特征。为促进有效蒸馏,我们提出了一种支持模态内、跨模态及多模态联合蒸馏的模拟多模态蒸馏方案。通过三者协同,BEVSimDet可在保持低成本纯相机部署优势的同时,学习更优的3D目标检测特征表示。在具有挑战性的nuScenes基准上的实验结果表明,BEVSimDet相较于近期代表性方法具有有效性与优越性。相关源代码将公开发布。