Current research is primarily dedicated to advancing the accuracy of camera-only 3D object detectors (apprentice) through the knowledge transferred from LiDAR- or multi-modal-based counterparts (expert). However, the presence of the domain gap between LiDAR and camera features, coupled with the inherent incompatibility in temporal fusion, significantly hinders the effectiveness of distillation-based enhancements for apprentices. Motivated by the success of uni-modal distillation, an apprentice-friendly expert model would predominantly rely on camera features, while still achieving comparable performance to multi-modal models. To this end, we introduce VCD, a framework to improve the camera-only apprentice model, including an apprentice-friendly multi-modal expert and temporal-fusion-friendly distillation supervision. The multi-modal expert VCD-E adopts an identical structure as that of the camera-only apprentice in order to alleviate the feature disparity, and leverages LiDAR input as a depth prior to reconstruct the 3D scene, achieving the performance on par with other heterogeneous multi-modal experts. Additionally, a fine-grained trajectory-based distillation module is introduced with the purpose of individually rectifying the motion misalignment for each object in the scene. With those improvements, our camera-only apprentice VCD-A sets new state-of-the-art on nuScenes with a score of 63.1% NDS.
翻译:当前研究主要致力于通过从基于激光雷达或多模态的对应模型(专家)迁移知识,来提升纯相机三维目标检测器(学徒)的精度。然而,激光雷达与相机特征之间存在领域差异,加上时间融合中固有的不兼容性,严重阻碍了蒸馏增强方法对学徒模型的有效性。受单模态蒸馏成功经验的启发,一个对学徒友好的专家模型应主要依赖相机特征,同时仍能达到与多模态模型相当的性能。为此,我们提出VCD框架以改进纯相机学徒模型,该框架包含对学徒友好的多模态专家和时间融合友好的蒸馏监督机制。多模态专家VCD-E采用与纯相机学徒相同的架构以缓解特征差异,并利用激光雷达输入作为深度先验来重建三维场景,从而获得与其他异构多模态专家相当的性能。此外,我们引入基于细粒度轨迹的蒸馏模块,旨在分别校正场景中每个目标的运动错位。通过这些改进,我们的纯相机学徒VCD-A在nuScenes数据集上以63.1% NDS的得分创造了新的最佳性能记录。