SimDistill: Simulated Multi-modal Distillation for BEV 3D Object Detection

Multi-view camera-based 3D object detection has become popular due to its low cost, but accurately inferring 3D geometry solely from camera data remains challenging and may lead to inferior performance. Although distilling precise 3D geometry knowledge from LiDAR data could help tackle this challenge, the benefits of LiDAR information could be greatly hindered by the significant modality gap between different sensory modalities. To address this issue, we propose a Simulated multi-modal Distillation (SimDistill) method by carefully crafting the model architecture and distillation strategy. Specifically, we devise multi-modal architectures for both teacher and student models, including a LiDAR-camera fusion-based teacher and a simulated fusion-based student. Owing to the ``identical'' architecture design, the student can mimic the teacher to generate multi-modal features with merely multi-view images as input, where a geometry compensation module is introduced to bridge the modality gap. Furthermore, we propose a comprehensive multi-modal distillation scheme that supports intra-modal, cross-modal, and multi-modal fusion distillation simultaneously in the Bird's-eye-view space. Incorporating them together, our SimDistill can learn better feature representations for 3D object detection while maintaining a cost-effective camera-only deployment. Extensive experiments validate the effectiveness and superiority of SimDistill over state-of-the-art methods, achieving an improvement of 4.8\% mAP and 4.1\% NDS over the baseline detector. The source code will be released at https://github.com/ViTAE-Transformer/SimDistill.

翻译：基于多视角相机的三维目标检测因其低成本而受到广泛关注，但仅从相机数据准确推断三维几何信息仍具挑战性，可能导致性能不佳。尽管从激光雷达数据中蒸馏精确的三维几何知识有助于应对这一挑战，但不同感知模态间的显著模态差距会极大限制激光雷达信息的收益。为解决该问题，我们提出一种名为模拟多模态蒸馏（SimDistill）的方法，通过精心设计模型架构与蒸馏策略实现。具体而言，我们为教师模型和学生模型分别设计了多模态架构，包括基于激光雷达-相机融合的教师模型和基于模拟融合的学生模型。得益于“相同”的架构设计，学生模型仅凭多视角图像输入即可模仿教师模型生成多模态特征，并引入几何补偿模块弥合模态差距。此外，我们提出一种全面的多模态蒸馏方案，支持在鸟瞰图空间中同步进行模态内蒸馏、跨模态蒸馏和多模态融合蒸馏。通过上述方法的整合，SimDistill能够在保持低成本纯相机部署的同时，为三维目标检测学习更优的特征表示。大量实验验证了SimDistill相较于现有最优方法的有效性与优越性，其在基线检测器基础上分别提升了4.8%的mAP和4.1%的NDS。源代码将在https://github.com/ViTAE-Transformer/SimDistill 公开。