*: Improving the 3D detector by introducing Voxel2Pillar feature encoding and extracting multi-scale features

The multi-line LiDAR is widely used in autonomous vehicles, so point cloud-based 3D detectors are essential for autonomous driving. Extracting rich multi-scale features is crucial for point cloud-based 3D detectors in autonomous driving due to significant differences in the size of different types of objects. However, because of the real-time requirements, large-size convolution kernels are rarely used to extract large-scale features in the backbone. Current 3D detectors commonly use feature pyramid networks to obtain large-scale features; however, some objects containing fewer point clouds are further lost during down-sampling, resulting in degraded performance. Since pillar-based schemes require much less computation than voxel-based schemes, they are more suitable for constructing real-time 3D detectors. Hence, we propose the *, a pillar-based scheme. We redesigned the feature encoding, the backbone, and the neck of the 3D detector. We propose the Voxel2Pillar feature encoding, which uses a sparse convolution constructor to construct pillars with richer point cloud features, especially height features. The Voxel2Pillar adds more learnable parameters to the feature encoding, enabling the initial pillars to have higher performance ability. We extract multi-scale and large-scale features in the proposed fully sparse backbone, which does not utilize large-size convolutional kernels; the backbone consists of the proposed multi-scale feature extraction module. The neck consists of the proposed sparse ConvNeXt, whose simple structure significantly improves the performance. We validate the effectiveness of the proposed * on the Waymo Open Dataset, and the object detection accuracy for vehicles, pedestrians, and cyclists is improved. We also verify the effectiveness of each proposed module in detail through ablation studies.

翻译：多线激光雷达广泛应用于自动驾驶车辆，因此基于点云的三维检测器对自动驾驶至关重要。由于不同类型物体的尺寸差异显著，提取丰富的多尺度特征对于自动驾驶中基于点云的三维检测器极为关键。然而，受实时性要求限制，主干网络很少采用大尺寸卷积核来提取大尺度特征。当前三维检测器通常采用特征金字塔网络获取大尺度特征，但部分点云稀疏的物体在下采样过程中会进一步丢失信息，导致性能下降。基于柱体（pillar）的方案相比基于体素（voxel）的方案计算量大幅减少，更适用于构建实时三维检测器。为此，我们提出了基于柱体方案的*。我们重新设计了三维检测器的特征编码模块、主干网络和颈部网络。提出的Voxel2Pillar特征编码采用稀疏卷积构造器构建具有更丰富点云特征（特别是高度特征）的柱体。该编码通过增加可学习参数，使初始柱体具备更强的特征表达能力。我们在提出的全稀疏主干网络中提取多尺度与大尺度特征，该主干网络未使用大尺寸卷积核，而是由我们设计的多尺度特征提取模块构成。颈部网络采用提出的稀疏ConvNeXt，其简洁结构显著提升了检测性能。我们在Waymo开放数据集上验证了*的有效性，其在车辆、行人及骑行者三类目标的检测精度均有提升。此外，通过消融实验详细验证了各提出模块的有效性。