In the perception task of autonomous driving, multi-modal methods have become a trend due to the complementary characteristics of LiDAR point clouds and image data. However, the performance of previous methods is usually limited by the sparsity of the point cloud or the noise problem caused by the misalignment between LiDAR and the camera. To solve these two problems, we present a new concept, Voxel Region (VR), which is obtained by projecting the sparse local point clouds in each voxel dynamically. And we propose a novel fusion method, named Sparse-to-Dense Voxel Region Fusion (SDVRF). Specifically, more pixels of the image feature map inside the VR are gathered to supplement the voxel feature extracted from sparse points and achieve denser fusion. Meanwhile, different from prior methods, which project the size-fixed grids, our strategy of generating dynamic regions achieves better alignment and avoids introducing too much background noise. Furthermore, we propose a multi-scale fusion framework to extract more contextual information and capture the features of objects of different sizes. Experiments on the KITTI dataset show that our method improves the performance of different baselines, especially on classes of small size, including Pedestrian and Cyclist.
翻译:在自动驾驶感知任务中,由于激光雷达点云与图像数据具有互补特性,多模态方法已成为发展趋势。然而,现有方法的性能通常受限于点云的稀疏性,或由激光雷达与相机之间未对准引发的噪声问题。为解决这两类问题,我们提出"体素区域"(Voxel Region, VR)这一新概念——通过动态投影每个体素内的稀疏局部点云获取该区域。进而提出名为"稀疏到稠密体素区域融合"(SDVRF)的新型融合方法。具体而言,我们在VR内部聚集更多图像特征图的像素,用以补充从稀疏点提取的体素特征,实现更稠密的融合。与先前采用固定尺寸网格投影的方法不同,我们提出动态区域生成策略,既能实现更优的坐标对齐,又可避免引入过多背景噪声。此外,我们设计多尺度融合框架以提取更丰富的上下文信息,并捕捉不同尺寸目标的特征。在KITTI数据集上的实验表明,本方法能够显著提升多类基准模型的性能,尤其针对行人、骑行者等小尺寸目标类别。