Fusing data from cameras and LiDAR sensors is an essential technique to achieve robust 3D object detection. One key challenge in camera-LiDAR fusion involves mitigating the large domain gap between the two sensors in terms of coordinates and data distribution when fusing their features. In this paper, we propose a novel camera-LiDAR fusion architecture called, 3D Dual-Fusion, which is designed to mitigate the gap between the feature representations of camera and LiDAR data. The proposed method fuses the features of the camera-view and 3D voxel-view domain and models their interactions through deformable attention. We redesign the transformer fusion encoder to aggregate the information from the two domains. Two major changes include 1) dual query-based deformable attention to fuse the dual-domain features interactively and 2) 3D local self-attention to encode the voxel-domain queries prior to dual-query decoding. The results of an experimental evaluation show that the proposed camera-LiDAR fusion architecture achieved competitive performance on the KITTI and nuScenes datasets, with state-of-the-art performances in some 3D object detection benchmarks categories.
翻译:将相机与激光雷达传感器的数据进行融合是实现鲁棒3D目标检测的关键技术。相机-激光雷达融合面临的主要挑战之一在于,融合两者特征时需弥合传感器在坐标体系与数据分布上的巨大域差异。本文提出一种新型相机-激光雷达融合架构——3D双融合,该架构旨在缩小相机数据与激光雷达数据特征表示之间的差距。所提方法融合了相机视角域与3D体素视角域的特征,并通过可变形注意力机制建模其交互关系。我们重新设计了Transformer融合编码器以聚合两个域的信息。两项关键改进包括:1)基于双查询的可变形注意力机制,用于交互式融合双域特征;2)在双查询解码前引入3D局部自注意力机制,对体素域查询进行预编码。实验评估结果表明,所提出的相机-激光雷达融合架构在KITTI和nuScenes数据集上取得了具有竞争力的性能,并在部分3D目标检测基准类别中达到了当前最优水平。