Multi-view camera-based 3D object detection has gained popularity due to its low cost. But accurately inferring 3D geometry solely from camera data remains challenging, which impacts model performance. One promising approach to address this issue is to distill precise 3D geometry knowledge from LiDAR data. However, transferring knowledge between different sensor modalities is hindered by the significant modality gap. In this paper, we approach this challenge from the perspective of both architecture design and knowledge distillation and present a new simulated multi-modal 3D object detection method named BEVSimDet. We first introduce a novel framework that includes a LiDAR and camera fusion-based teacher and a simulated multi-modal student, where the student simulates multi-modal features with image-only input. To facilitate effective distillation, we propose a simulated multi-modal distillation scheme that supports intra-modal, cross-modal, and multi-modal distillation simultaneously. By combining them together, BEVSimDet can learn better feature representations for 3D object detection while enjoying cost-effective camera-only deployment. Experimental results on the challenging nuScenes benchmark demonstrate the effectiveness and superiority of BEVSimDet over recent representative methods. The source code will be released at \href{https://github.com/ViTAE-Transformer/BEVSimDet}{BEVSimDet}.
翻译:基于多视角相机的三维目标检测因其低成本而受到广泛关注。然而,仅从相机数据准确推断三维几何结构仍具挑战性,这限制了模型性能。解决该问题的一种有前景的方法是从激光雷达数据中蒸馏精确的三维几何知识。然而,不同传感器模态间的显著差异阻碍了知识的有效迁移。本文从架构设计与知识蒸馏两个角度应对这一挑战,提出一种名为BEVSimDet的新型模拟多模态三维目标检测方法。我们首先引入一个创新框架,包含基于激光雷达与相机融合的教师模型以及一个模拟多模态的学生模型,其中学生模型仅通过图像输入模拟多模态特征。为实现高效蒸馏,我们提出一种支持模态内、跨模态与多模态同步蒸馏的模拟多模态蒸馏方案。通过三者结合,BEVSimDet能够在保持低成本纯相机部署优势的同时,学习更优的三维目标检测特征表示。在具有挑战性的nuScenes基准数据集上的实验结果表明,相较于近期代表性方法,BEVSimDet展现出有效性与优越性。源代码将发布于\href{https://github.com/ViTAE-Transformer/BEVSimDet}{BEVSimDet}。