LoGoNet: Towards Accurate 3D Object Detection with Local-to-Global Cross-Modal Fusion

LiDAR-camera fusion methods have shown impressive performance in 3D object detection. Recent advanced multi-modal methods mainly perform global fusion, where image features and point cloud features are fused across the whole scene. Such practice lacks fine-grained region-level information, yielding suboptimal fusion performance. In this paper, we present the novel Local-to-Global fusion network (LoGoNet), which performs LiDAR-camera fusion at both local and global levels. Concretely, the Global Fusion (GoF) of LoGoNet is built upon previous literature, while we exclusively use point centroids to more precisely represent the position of voxel features, thus achieving better cross-modal alignment. As to the Local Fusion (LoF), we first divide each proposal into uniform grids and then project these grid centers to the images. The image features around the projected grid points are sampled to be fused with position-decorated point cloud features, maximally utilizing the rich contextual information around the proposals. The Feature Dynamic Aggregation (FDA) module is further proposed to achieve information interaction between these locally and globally fused features, thus producing more informative multi-modal features. Extensive experiments on both Waymo Open Dataset (WOD) and KITTI datasets show that LoGoNet outperforms all state-of-the-art 3D detection methods. Notably, LoGoNet ranks 1st on Waymo 3D object detection leaderboard and obtains 81.02 mAPH (L2) detection performance. It is noteworthy that, for the first time, the detection performance on three classes surpasses 80 APH (L2) simultaneously. Code will be available at \url{https://github.com/sankin97/LoGoNet}.

翻译：激光雷达与相机融合方法在三维物体检测中展现出显著性能。当前先进的多模态方法主要采用全局融合策略，即在整个场景中融合图像特征与点云特征。然而，此类方法缺乏细粒度的区域级信息，导致融合性能欠佳。本文提出新型局部到全局融合网络（LoGoNet），该网络在局部与全局两个层面实现激光雷达-相机融合。具体而言，LoGoNet的全局融合模块（GoF）基于现有文献构建，但本文独创性地采用点质心来更精确地表示体素特征的位置，从而实现更优的跨模态对齐。对于局部融合模块（LoF），我们首先将每个提议区域划分为均匀网格，随后将这些网格中心投影至图像平面。通过采样投影网格点周围的图像特征，将其与位置增强后的点云特征相融合，从而最大化利用提议区域周围的丰富上下文信息。进一步提出的特征动态聚合模块（FDA）可实现局部与全局融合特征间的信息交互，由此生成更具信息量的多模态特征。在Waymo公开数据集（WOD）与KITTI数据集上的大量实验表明，LoGoNet的性能超越了所有现有最优的三维检测方法。值得注意的是，LoGoNet在Waymo三维物体检测排行榜上位列第一，取得了81.02 mAPH（L2）的检测性能。尤为突出的是，该网络首次使三个类别的检测性能同时超过80 APH（L2）。代码将发布于\url{https://github.com/sankin97/LoGoNet}。