LiDAR and cameras are complementary sensors for 3D object detection in autonomous driving. However, it is challenging to explore the unnatural interaction between point clouds and images, and the critical factor is how to conduct feature alignment of heterogeneous modalities. Currently, many methods achieve feature alignment by projection calibration only, without considering the problem of coordinate conversion accuracy errors between sensors, leading to sub-optimal performance. In this paper, we present GraphAlign, a more accurate feature alignment strategy for 3D object detection by graph matching. Specifically, we fuse image features from a semantic segmentation encoder in the image branch and point cloud features from a 3D Sparse CNN in the LiDAR branch. To save computation, we construct the nearest neighbor relationship by calculating Euclidean distance within the subspaces that are divided into the point cloud features. Through the projection calibration between the image and point cloud, we project the nearest neighbors of point cloud features onto the image features. Then by matching the nearest neighbors with a single point cloud to multiple images, we search for a more appropriate feature alignment. In addition, we provide a self-attention module to enhance the weights of significant relations to fine-tune the feature alignment between heterogeneous modalities. Extensive experiments on nuScenes benchmark demonstrate the effectiveness and efficiency of our GraphAlign.
翻译:激光雷达和摄像头是自动驾驶中用于3D目标检测的互补传感器。然而,探索点云与图像之间的非自然交互具有挑战性,其关键因素在于如何实现异质模态的特征对齐。目前,许多方法仅通过投影标定实现特征对齐,未考虑传感器间坐标转换精度误差的问题,导致性能欠佳。本文提出GraphAlign——一种基于图匹配的更高精度3D目标检测特征对齐策略。具体而言,我们在图像分支中融合来自语义分割编码器的图像特征,在激光雷达分支中融合来自3D稀疏CNN的点云特征。为节省计算量,我们通过计算点云特征划分的子空间内欧氏距离来构建最近邻关系。通过图像与点云之间的投影标定,将点云特征的最近邻投影至图像特征上。随后通过将单点云的最近邻与多张图像进行匹配,搜索更合适的特征对齐方式。此外,我们引入自注意力模块增强重要关系的权重,以微调异质模态间的特征对齐。在nuScenes基准上的大量实验证明了GraphAlign的有效性与高效性。