Integrating LiDAR and camera information in the bird's eye view (BEV) representation has demonstrated its effectiveness in 3D object detection. However, because of the fundamental disparity in geometric accuracy between these sensors, indiscriminate fusion in previous methods often leads to degraded performance. In this paper, we propose BEVDilation, a novel LiDAR-centric framework that prioritizes LiDAR information in the fusion. By formulating image BEV features as implicit guidance rather than naive concatenation, our strategy effectively alleviates the spatial misalignment caused by image depth estimation errors. Furthermore, the image guidance can effectively help the LiDAR-centric paradigm to address the sparsity and semantic limitations of point clouds. Specifically, we propose a Sparse Voxel Dilation Block that mitigates the inherent point sparsity by densifying foreground voxels through image priors. Moreover, we introduce a Semantic-Guided BEV Dilation Block to enhance the LiDAR feature diffusion processing with image semantic guidance and long-range context capture. On the challenging nuScenes benchmark, BEVDilation achieves better performance than state-of-the-art methods while maintaining competitive computational efficiency. Importantly, our LiDAR-centric strategy demonstrates greater robustness to depth noise compared to naive fusion. The source code is available at https://github.com/gwenzhang/BEVDilation.
翻译:在鸟瞰图表示中融合激光雷达与相机信息已被证明对三维目标检测具有显著效果。然而,由于这些传感器在几何精度上存在本质差异,先前方法中的无差别融合常导致性能下降。本文提出BEVDilation,一种新颖的以激光雷达为中心的融合框架,该框架在融合过程中优先利用激光雷达信息。通过将图像BEV特征建模为隐式引导而非简单拼接,我们的策略有效缓解了由图像深度估计误差导致的空间错位问题。此外,图像引导能有效帮助以激光雷达为中心的范式应对点云的稀疏性与语义局限性。具体而言,我们提出稀疏体素扩张模块,通过图像先验信息对前景体素进行稠密化,从而缓解点云固有的稀疏性问题。进一步,我们引入语义引导的BEV扩张模块,借助图像语义引导与长程上下文捕捉来增强激光雷达特征扩散处理。在具有挑战性的nuScenes基准测试中,BEVDilation在保持计算效率竞争力的同时,取得了优于现有先进方法的性能。值得注意的是,与简单融合方法相比,我们以激光雷达为中心的策略对深度噪声表现出更强的鲁棒性。源代码已发布于https://github.com/gwenzhang/BEVDilation。