*: Improving the 3D detector by introducing Voxel2Pillar feature encoding and extracting multi-scale features

The multi-line LiDAR is widely used in autonomous vehicles, so point cloud-based 3D detectors are essential for autonomous driving. Extracting rich multi-scale features is crucial for point cloud-based 3D detectors in autonomous driving due to significant differences in the size of different types of objects. However, because of the real-time requirements, large-size convolution kernels are rarely used to extract large-scale features in the backbone. Current 3D detectors commonly use feature pyramid networks to obtain large-scale features; however, some objects containing fewer point clouds are further lost during down-sampling, resulting in degraded performance. Since pillar-based schemes require much less computation than voxel-based schemes, they are more suitable for constructing real-time 3D detectors. Hence, we propose the *, a pillar-based scheme. We redesigned the feature encoding, the backbone, and the neck of the 3D detector. We propose the Voxel2Pillar feature encoding, which uses a sparse convolution constructor to construct pillars with richer point cloud features, especially height features. The Voxel2Pillar adds more learnable parameters to the feature encoding, enabling the initial pillars to have higher performance ability. We extract multi-scale and large-scale features in the proposed fully sparse backbone, which does not utilize large-size convolutional kernels; the backbone consists of the proposed multi-scale feature extraction module. The neck consists of the proposed sparse ConvNeXt, whose simple structure significantly improves the performance. We validate the effectiveness of the proposed * on the Waymo Open Dataset, and the object detection accuracy for vehicles, pedestrians, and cyclists is improved. We also verify the effectiveness of each proposed module in detail through ablation studies.

翻译：多线激光雷达广泛应用于自动驾驶车辆，因此基于点云的3D检测器对自动驾驶至关重要。由于不同类型物体的尺寸差异显著，提取丰富的多尺度特征对于自动驾驶中基于点云的3D检测器极为关键。然而，受实时性要求限制，骨干网络很少使用大尺寸卷积核来提取大尺度特征。当前3D检测器通常采用特征金字塔网络获取大尺度特征，但部分点云较少的物体在下采样过程中会进一步丢失信息，导致性能下降。由于基于柱体（pillar）的方案比基于体素（voxel）的方案计算量少得多，更适合构建实时3D检测器。为此，我们提出了*，一种基于柱体的方案。我们重新设计了3D检测器的特征编码、骨干网络和颈部结构。我们提出了Voxel2Pillar特征编码方法，该方法使用稀疏卷积构造器构建具有更丰富点云特征（尤其是高度特征）的柱体。Voxel2Pillar为特征编码增加了更多可学习参数，使初始柱体具备更强的表征能力。我们在提出的全稀疏骨干网络中提取多尺度与大尺度特征，该骨干网络未使用大尺寸卷积核，而是由我们提出的多尺度特征提取模块构成。颈部结构采用提出的稀疏ConvNeXt，其简洁结构显著提升了性能。我们在Waymo开放数据集上验证了所提*的有效性，车辆、行人和骑行者的目标检测精度均得到提升。我们还通过消融实验详细验证了每个提出模块的有效性。