PillarNeXt: Improving the 3D detector by introducing Voxel2Pillar feature encoding and extracting multi-scale features

Multi-line LiDAR is widely used in autonomous vehicles, so point cloud-based 3D detectors are essential for autonomous driving. Extracting rich multi-scale features is crucial for point cloud-based 3D detectors in autonomous driving due to significant differences in the size of different types of objects. However, due to the real-time requirements, large-size convolution kernels are rarely used to extract large-scale features in the backbone. Current 3D detectors commonly use feature pyramid networks to obtain large-scale features; however, some objects containing fewer point clouds are further lost during downsampling, resulting in degraded performance. Since pillar-based schemes require much less computation than voxel-based schemes, they are more suitable for constructing real-time 3D detectors. Hence, we propose PillarNeXt, a pillar-based scheme. We redesigned the feature encoding, the backbone, and the neck of the 3D detector. We propose Voxel2Pillar feature encoding, which uses a sparse convolution constructor to construct pillars with richer point cloud features, especially height features. Moreover, additional learnable parameters are added, which enables the initial pillar to achieve higher performance capabilities. We extract multi-scale and large-scale features in the proposed fully sparse backbone, which does not utilize large-size convolutional kernels; the backbone consists of the proposed multi-scale feature extraction module. The neck consists of the proposed sparse ConvNeXt, whose simple structure significantly improves the performance. The effectiveness of the proposed PillarNeXt is validated on the Waymo Open Dataset, and object detection accuracy for vehicles, pedestrians, and cyclists is improved; we also verify the effectiveness of each proposed module in detail.

翻译：多线激光雷达在自动驾驶车辆中广泛应用，因此基于点云的3D检测器对自动驾驶至关重要。由于不同类型物体的尺寸差异显著，提取丰富的多尺度特征对于基于点云的3D检测器在自动驾驶中的应用至关重要。然而，受实时性需求限制，主骨干网络中很少使用大尺寸卷积核来提取大尺度特征。当前3D检测器普遍采用特征金字塔网络获取大尺度特征，但部分包含较少点云的物体在下采样过程中进一步丢失，导致性能下降。由于基于柱状体（pillar）的方案相比基于体素（voxel）的方案计算量大幅降低，因此更适合构建实时3D检测器。为此，我们提出PillarNeXt这一基于柱状体的方案，重新设计了3D检测器的特征编码、主骨干网络和颈部结构。我们提出Voxel2Pillar特征编码，利用稀疏卷积构建器构造具有更丰富点云特征（尤其高度特征）的柱状体。此外，引入额外的可学习参数使初始柱状体获得更高性能。在提出的全稀疏主骨干网络中，我们提取多尺度和大尺度特征，该网络未使用大尺寸卷积核，而是由提出的多尺度特征提取模块构成。颈部由提出的稀疏ConvNeXt组成，其简洁的结构显著提升了性能。所提PillarNeXt的有效性在Waymo Open数据集上得到验证，对车辆、行人和骑行者的目标检测精度均有所提升；我们还对各模块的有效性进行了详细验证。