Efficient point cloud representation is a fundamental element of Lidar-based 3D object detection. Recent grid-based detectors usually divide point clouds into voxels or pillars and construct single-stream networks in Bird's Eye View. However, these point cloud encoding paradigms underestimate the point representation in the vertical direction, which cause the loss of semantic or fine-grained information, especially for vertical sensitive objects like pedestrian and cyclists. In this paper, we propose an explicit vertical multi-scale representation learning framework, VPFusion, to combine the complementary information from both voxel and pillar streams. Specifically, VPFusion first builds upon a sparse voxel-pillar-based backbone. The backbone divides point clouds into voxels and pillars, then encodes features with 3D and 2D sparse convolution simultaneously. Next, we introduce the Sparse Fusion Layer (SFL), which establishes a bidirectional pathway for sparse voxel and pillar features to enable the interaction between them. Additionally, we present the Dense Fusion Neck (DFN) to effectively combine the dense feature maps from voxel and pillar branches with multi-scale. Extensive experiments on the large-scale Waymo Open Dataset and nuScenes Dataset demonstrate that VPFusion surpasses the single-stream baselines by a large margin and achieves state-of-the-art performance with real-time inference speed.
翻译:高效的点云表示是激光雷达3D物体检测的基础要素。当前基于网格的检测器通常将点云划分为体素或柱体,并在鸟瞰视角下构建单流网络。然而,这些点云编码范式低估了垂直方向上的点表示能力,导致语义或细粒度信息丢失,尤其对行人、骑行者等对垂直方向敏感的物体影响显著。本文提出一种显式垂直多尺度表示学习框架VPFusion,用于融合体素流与柱体流的互补信息。具体而言,VPFusion首先基于稀疏体素-柱体主干网络,该网络将点云划分为体素和柱体,并同步采用3D和2D稀疏卷积进行特征编码。其次,我们引入稀疏融合层(SFL),通过建立稀疏体素与柱体特征的双向通路实现二者交互。此外,提出密集融合颈部(DFN)以多尺度方式有效融合体素分支与柱体分支的密集特征图。在Waymo Open Dataset和nuScenes Dataset大规模数据集上的实验表明,VPFusion以显著优势超越单流基线方法,并在实时推理速度下达到最优性能。